carolzhou / multiPhATE2

multiPhATE with comparative genomics
18 stars 10 forks source link

Error while running gene calling-consensus_1_geneCall' has zero length #3

Closed ShailNair closed 4 years ago

ShailNair commented 4 years ago

Hi, I got an error when consensus or superset is used as the primary gene caller after the gene calling is completed. I have attached the terminal output here. Similarly, the phate_runPipeline.log says "WARNING: User has selected genome type as phage, but primary gene-call file as consensus/superset.cgc"

log.txt phate_runPipeline.log sii.multiPhate.config.txt

carolzhou commented 4 years ago

The warning message happens whenever you specify that the organism is phage, but select a gene caller (or subset) that is not Phanotate, the only phage-specific gene caller. This warning can be ignored. I will need more information to troubleshoot why the consensus file had zero length. Can you send me the genome fasta file?

From: Shail notifications@github.com Reply-To: carolzhou/multiPhATE2 reply@reply.github.com Date: Thursday, October 1, 2020 at 12:49 AM To: carolzhou/multiPhATE2 multiPhATE2@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: [carolzhou/multiPhATE2] Error while running gene calling-consensus_1_geneCall' has zero length (#3)

Hi, I got an error when consensus or superset is used as the primary gene caller after the gene calling is completed. I have attached the terminal output here. Similarly, the phate_runPipeline.log says "WARNING: User has selected genome type as phage, but primary gene-call file as consensus sii.multiPhate.config.txthttps://github.com/carolzhou/multiPhATE2/files/5310295/sii.multiPhate.config.txt

/superset.cgc"

phate_runPipeline.loghttps://github.com/carolzhou/multiPhATE2/files/5310287/phate_runPipeline.log

log.txthttps://github.com/carolzhou/multiPhATE2/files/5310243/log.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO23ZREJ2LS3J7IBE33SIQX6NANCNFSM4SACFAUA.

ShailNair commented 4 years ago

ok, thanks. Here is the file. https://drive.google.com/file/d/1qErM8a-yp3PUksjSd7rS5O7G3XSVVuu8/view?usp=sharing

carolzhou commented 4 years ago

Go to the PipelineInput/ directory and run your fasta file through the cleaner script, capturing the output in a new filename (‘_c’ stands for “clean”):

$ python ../Utility/cleanHeaders.py myGenomeFile.fasta > myGenomeFile_c.fasta

Then, modify your config file to reflect the new genome fasta filename. Rerun multiPhATE2 and you should get correct gene-call outputs. It is always potentially problematic to have non-alphanumeric characters or spaces in fasta headers, as 3rd party codes may balk when encountering these characters, or 3rd party codes may truncate strings after spaces.

Let me know if this does not solve the problem.

From: Shail notifications@github.com Reply-To: carolzhou/multiPhATE2 reply@reply.github.com Date: Thursday, October 1, 2020 at 8:46 AM To: carolzhou/multiPhATE2 multiPhATE2@noreply.github.com Cc: Carol Zhou zhou4@llnl.gov, Comment comment@noreply.github.com Subject: Re: [carolzhou/multiPhATE2] Error while running gene calling-consensus_1_geneCall' has zero length (#3)

ok, thanks. Here is the file. https://drive.google.com/file/d/1qErM8a-yp3PUksjSd7rS5O7G3XSVVuu8/view?usp=sharing

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-702225794, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO75RPZELPHJXVBVPEDSISP4XANCNFSM4SACFAUA.

ShailNair commented 4 years ago

Thanks. that solved the issue. next time should be careful with the headers in fasta file. The gene calling and the translation is successfully executed now. Now, I am facing issues with the database path. I have downloaded and formatted all the databases except NR and kegg. As mentioned we need to provide full path/filename for the databases in the configuration file. But for some databases like refseq, there are many files. So, which filepath i should assign (or is it just the respective directory path).

Similarly, when doing gene blast I am getting an error with voggs database. Here is the error (related to file path I guess..)

BLAST Database error: No alias or index file found for nucleotide database [/home/ahail/Documents/multiphate2/Databases/VOGs/vog.genes.all.fa] in search path [/home/ahail/Documents/multiphate2::]

The NCBI virus genome blast is going well.

Attached files:

sii.multiPhate.config.txt Terminal_output.txt database_directory_listing.txt

Sorry for all the silly troubles I am making :stuck_out_tongue:

Thanks for your prompt replies and all the hard work.

carolzhou commented 4 years ago

Normally if you have a segmented database, you name it using the generic root. For example, the location of NR would be something like this:

/Home/userMe/multiphateDir/Databases/NR/nr

Refseq Protein would be something like this:

/Home/userMe/multiphateDir/Databases/Refseq/Protein/refseq_protein

But Phantome would be:

/Home/userMe/multiphateDir/Databases/Phantome/Phantome_Phage_genes.faa

Regarding the VOGs and pVOGs, these databases need to be preprocessed by multiPhATE2 code in order to prepare the fasta headers to contain the VOG identifiers, so that hits can be matched to their annotations. The dbPrep_getDBs.py script will download and preprocess the files for you. Then, you need to specify the locations referring to the preprocessed database files:

/Home/userMe/multiphateDir/Databases/VOGs/vog.genes.tagged.all.fa /Home/userMe/multiphateDir/Databases/VOGs/vog.proteins.tagged.all.fa

Using the raw vog fasta files will not work.

If you download files using dbPrep_getDBs.py, the script will also compute and output a listing of the local path/filenames for each of the downloaded (and preprocessed) files, which you can copy/paste into your configuration file.

I hope that helps.

-Carol

From: Shail notifications@github.com Reply-To: carolzhou/multiPhATE2 reply@reply.github.com Date: Thursday, October 1, 2020 at 9:34 PM To: carolzhou/multiPhATE2 multiPhATE2@noreply.github.com Cc: Carol Zhou zhou4@llnl.gov, Comment comment@noreply.github.com Subject: Re: [carolzhou/multiPhATE2] Error while running gene calling-consensus_1_geneCall' has zero length (#3)

Thanks. that solved the issue. next time should be careful with the headers in fasta file. The gene calling and the translation is successfully executed now. Now, I am facing issues with the database path. I have downloaded and formatted all the databases except NR and kegg. As mentioned we need to provide full path/filename for the databases in the configuration file. But for some databases like refseq, there are many files. So, which filepath i should assign (or is it just the respective directory path).

Similarly, when doing gene blast I am getting an error with voggs database. Here is the error (related to file path I guess..)

BLAST Database error: No alias or index file found for nucleotide database [/home/ahail/Documents/multiphate2/Databases/VOGs/vog.genes.all.fa] in search path [/home/ahail/Documents/multiphate2::]

The NCBI virus genome blast is going well.

Attached files:

sii.multiPhate.config.txthttps://github.com/carolzhou/multiPhATE2/files/5315980/sii.multiPhate.config.txt Terminal_output.txthttps://github.com/carolzhou/multiPhATE2/files/5315981/Terminal_output.txt database_directory_listing.txthttps://github.com/carolzhou/multiPhATE2/files/5315985/database_directory_listing.txt

Sorry for all the silly troubles I am making 😛

Thanks for your prompt replies and all the hard work.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-702519992, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO752TNLRY2F6IS74LDSIVJ4DANCNFSM4SACFAUA.

ShailNair commented 4 years ago

That's strange, actually, I used dbPrep_getDBs.py (though had to modify some lines to get Swissport and RefSeq database). Let me run it again to format the VOGs database and check. Thanks

ShailNair commented 4 years ago

Hi, I get error : Failed to open target sequence database, when run with phmmer and jackhammer (= true). When phmmer and jackhammer are set to off (=false) the pipeline successfully gets completed. I used dbPrep_getDBs.py to download and format all databases. Attached is the terminal output and config file.

Terminal_output.txt sii.multiPhate.config .txt

carolzhou commented 4 years ago

Hi Shail,

I see the problem, generating the error: BLAST Database error: No alias or index file found for nucleotide database [/home/ahail/Documents/multiphate2/Databases/VOGs/vog.genes.all.fa] in search path [/home/ahail/Documents/multiphate2::]. I’ll get back with you soon about how to address this.

Thanks, -Carol

From: Shail notifications@github.com Reply-To: carolzhou/multiPhATE2 reply@reply.github.com Date: Tuesday, October 13, 2020 at 12:50 AM To: carolzhou/multiPhATE2 multiPhATE2@noreply.github.com Cc: Carol Zhou zhou4@llnl.gov, Comment comment@noreply.github.com Subject: Re: [carolzhou/multiPhATE2] Error while running gene calling-consensus_1_geneCall' has zero length (#3)

Hi, I get error : Failed to open target sequence database, when run with phmmer and jackhammer (= true). When phher and jackhammer are set off (=false) the pipeline successfully gets completed. I used dbPrep_getDBs.py to download and format all databases. Attached is the terminal output and config file.

Terminal_output.txthttps://github.com/carolzhou/multiPhATE2/files/5369765/Terminal_output.txt sii.multiPhate.config .txthttps://github.com/carolzhou/multiPhATE2/files/5369769/sii.multiPhate.config.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-707559698, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO6GW4BZ2RQHK3VJWU3SKQBC7ANCNFSM4SACFAUA.

carolzhou commented 4 years ago

The vog gene database index error should be fixed now. Please download the new multiPhate.py file.

Regarding phmmer/jackhmmer, these programs search protein databases. Which database(s) were you searching against? Are you seeing an error message in this regard?

ShailNair commented 4 years ago

Hi, yes. the database issue have been fixed. thanks. Regarding phmmer/jackhmmer, here is the error log: hmer_multiphate_terminotput.txt

it fails to search against NR, swissprot and refseq_protein though blastp can blast successfully using the same database

carolzhou commented 4 years ago

Both phmmer and jackhmmer search sequence databases, as opposed to blast, which searches blast-formatted databases. Therefore, in order for either phmmer or jackhmmer to work with NR or Refseq Protein, the actual fasta sequences must be present in the respective directories. I did not include downloading of the NR and Refseq Protein sequence databases in the dbPrep_getDBs.py script, but these should indeed be added. In most cases, users do not download the sequences for these databases because they are huge. But for phmmer and jackhmmer, unfortunately, it is necessary, if one wishes to search against these databases. In the meantime, if you happen to have the fasta sequences for these databases on hand, then copy them over to your multiPhATE2/Databases/ subdirectories, or move the blast-formatted file to the same directory containing the actual sequences. Otherwise, you will need to manually download the fasta sequences for NR and Refseq Protein (it may take a long time).

Regarding Swissprot, dbPrep_getDBs.py uses blast+’s downloader to download the blast-formatted Swissprot database, but it does not download the actual sequences. The dbPrep_getDBs.py script downloads the Swissprot sequences from a different source. You can adjust the name of the downloaded Swissprot sequence fasta file from “uniprot_sprot.fasta” to “swissprot.fa” (to match the name of the blast-formatted file). (I will modify the dbPrep_getDBs.py script to do this renaming automatically.) Then, the phmmer and jackhmmer searches against Swissprot should work. Let me know if that does not solve the problem.

I hope that solves these issues for you; let me know if you find any other potential improvements. Thanks for using multiPhATE2.

From: Shail notifications@github.com Reply-To: carolzhou/multiPhATE2 reply@reply.github.com Date: Tuesday, October 13, 2020 at 5:19 PM To: carolzhou/multiPhATE2 multiPhATE2@noreply.github.com Cc: Carol Zhou zhou4@llnl.gov, Comment comment@noreply.github.com Subject: Re: [carolzhou/multiPhATE2] Error while running gene calling-consensus_1_geneCall' has zero length (#3)

Hi, yes. the database issue have been fixed. thanks. Regarding phmmer/jackhmmer, here is the error log: hmer_multiphate_terminotput.txthttps://github.com/carolzhou/multiPhATE2/files/5374812/hmer_multiphate_terminotput.txt

it fails to search against NR, swissprot and refseq_protein

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-708078658, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO766RAHQFAPJPMUJTTSKTVAZANCNFSM4SACFAUA.

ShailNair commented 4 years ago

Yes. i see the problem now. I guess it should work with the fasta sequences, unfortunately, which I do not have in hand now. May be I can use the hmmer online tool (https://www.ebi.ac.uk/Tools/hmmer/search/phmmer) for that. Thanks for the help.

carolzhou commented 4 years ago

For phmmer and jackhmmer to run against the Swissprot database, in your configuration file you need to provide the path/filename as .../swissprot.fa. As written (swissprot), phmmer and jackhmmer will not recognize the database. I checked the dbPrep_getDBs.py script, and it already does the renaming for Swissprot.

From: Shail notifications@github.com Reply-To: carolzhou/multiPhATE2 reply@reply.github.com Date: Tuesday, October 13, 2020 at 7:54 PM To: carolzhou/multiPhATE2 multiPhATE2@noreply.github.com Cc: Carol Zhou zhou4@llnl.gov, Comment comment@noreply.github.com Subject: Re: [carolzhou/multiPhATE2] Error while running gene calling-consensus_1_geneCall' has zero length (#3)

Yes. i see the problem now. I guess it should work with the fast sequences, unfortunately, which I do not have in hand now. May be I can use the hmmer online tool (https://www.ebi.ac.uk/Tools/hmmer/search/phmmer) for that. Thanks for the help.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-708123016, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO5L4EK2THX54RTLMG3SKUHE7ANCNFSM4SACFAUA.

ShailNair commented 4 years ago

Thanks. but then will it work with blast? currently, with .../swissprot, blast is working smoothly.

carolzhou commented 4 years ago

In my testing, it seems that blast works in either case, but the hmm search programs only work if the extension is included on the file name. But do let me know if your observation is different.

Sent from my iPhone

On Oct 14, 2020, at 5:37 PM, Shail notifications@github.com wrote:



Thanks. but then will it work with blast? currently, with .../swissprot, blast is working smoothly.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-708770634, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO5FHZ54F35DPIXN4Y3SKY75JANCNFSM4SACFAUA.

carolzhou commented 4 years ago

I uploaded a wrong file. Please be sure you have the latest version of SequenceAnnotation/phate_blast.py.

Sent from my iPhone

On Oct 15, 2020, at 9:46 AM, Zhou, Carol E. zhou4@llnl.gov wrote:

 In my testing, it seems that blast works in either case, but the hmm search programs only work if the extension is included on the file name. But do let me know if your observation is different.

Sent from my iPhone

On Oct 14, 2020, at 5:37 PM, Shail notifications@github.com wrote:



Thanks. but then will it work with blast? currently, with .../swissprot, blast is working smoothly.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/carolzhou/multiPhATE2/issues/3#issuecomment-708770634, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACWGOO5FHZ54F35DPIXN4Y3SKY75JANCNFSM4SACFAUA.

ShailNair commented 4 years ago

ok. I will update the script and run again. Is it just SequenceAnnotation and phate_blast.py file or do i need to update others too?