arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

Issue with mcl analysis (step 1) #17

Closed claudiacln87 closed 5 years ago

claudiacln87 commented 5 years ago

Hi Arthur, I am trying to use wgd tool to highly detect whole genome duplication in a black fungus, but I got an error during the step 1 (blast+ and mcl analysis) and I cannot get a .mcl file to proceed with analysis (I also tried with other genome sequences but got same error) Thank you very much in advance. I am using this command:

source activate wgd
module load ncbi-blast
module load mcl 
export LC_ALL=en_US.utf8
wgd mcl --cds --mcl -s Friedmanniomyces_endolithicus_CCFEE_5311.cds.fasta -o wgd__mcl_blast_output_Friedmanniomyces -n 4

This is the error:

2019-05-06 06:22:28: INFO       makeblastdb: 2.2.30+
Package: blast 2.2.30, build Oct 27 2014 16:58:06
2019-05-06 06:22:28: INFO       blastp: 2.2.30+
Package: blast 2.2.30, build Oct 27 2014 16:58:06
2019-05-06 06:22:28: INFO       mcl 14-137
Copyright (c) 1999-2014, Stijn van Dongen. mcl comes with NO WARRANTY
to the extent permitted by law. You may redistribute copies of mcl under
the terms of the GNU General Public License.
2019-05-06 06:22:28: INFO       Output directory: wgd__mcl_blast_output_Friedmanniomyces does not exist, will make it.
2019-05-06 06:22:28: INFO       CDS sequences provided, will first translate.
Sequence length != multiple of 3 for B0A54_02501-T1!
Invalid codon  CT in B0A54_02501-T1
Sequence length != multiple of 3 for 625353)]!
Invalid codon   T in 625353)]
Sequence length != multiple of 3 for 279254)]!
Invalid codon  AA in 279254)]
....
|#########################################################################################################| Elapsed Time: 0:00:21 Time:  0:00:21
2019-05-06 06:22:51: WARNING    There were 146 warnings during translation
2019-05-06 06:22:51: INFO       Writing blastdb sequences to db.fasta.
2019-05-06 06:22:51: INFO       Writing query sequences to query.fasta.
2019-05-06 06:22:51: INFO       Performing all-vs.-all Blastp (this might take a while)
2019-05-06 06:22:51: INFO       Making Blastdb

Building a new DB, current time: 05/06/2019 06:22:51
New DB name:   wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta
New DB title:  wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B

volume: wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta

file: wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta.pin
file: wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta.phr
file: wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta.psq

BLAST Database creation error: FASTA-Reader: No residues given
2019-05-06 06:22:52: INFO       Running Blastp
2019-05-06 06:22:52: INFO       blastp -db wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta -query wgd__mcl_blast_output_Friedmanniomyces/37522ff94b818a.query.fasta -evalue 1e-10 -outfmt 6 -num_threads 4 -out wgd__mcl_blast_output_Friedmanniomyces/Friedmanniomyces_endolithicus_CCFEE_5311.cds.fasta.blast.tsv
2019-05-06 06:22:52: INFO       All versus all Blastp done
rm: cannot remove ‘wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta.phr’: No such file or directory
rm: cannot remove ‘wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta.pin’: No such file or directory
rm: cannot remove ‘wgd__mcl_blast_output_Friedmanniomyces/37522ff93b3a44.db.fasta.psq’: No such file or directory
2019-05-06 06:22:52: INFO       Blast done
2019-05-06 06:22:52: INFO       Performing MCL clustering (inflation factor = 2.0)
2019-05-06 06:22:52: INFO       Started MCL clustering (mcl)
2019-05-06 06:22:52: INFO       Done
arzwa commented 5 years ago

Hi, I believe there must be something in your input fasta files that breaks the database creation step. For example here someone reported a similar problem, seemingly due to having empty records/blank lines in the fasta file. Note that sequences of length < 3 will also result in 'empty' sequences since the CDS file is translated in wgd. In general, I would recommend to filter the fasta file so that you only retain valid coding sequences before starting analyses using wgd.

Let me know if this helps, Arthur

claudiacln87 commented 5 years ago

Hi Arthur, I tried with your suggestion. Now the .log file says that, I am not sure if it is working (unless that step requires a while):

2019-05-06 09:51:53: INFO   makeblastdb: 2.2.30+
Package: blast 2.2.30, build Oct 27 2014 16:58:06
2019-05-06 09:51:54: INFO   blastp: 2.2.30+
Package: blast 2.2.30, build Oct 27 2014 16:58:06
2019-05-06 09:51:54: INFO   mcl 14-137
Copyright (c) 1999-2014, Stijn van Dongen. mcl comes with NO WARRANTY
to the extent permitted by law. You may redistribute copies of mcl under
the terms of the GNU General Public License.
2019-05-06 09:51:54: INFO   Output directory: wgd__mcl_blast_output_Frie_2 does not exist, will make it.
2019-05-06 09:51:54: INFO   Writing blastdb sequences to db.fasta.
2019-05-06 09:51:54: INFO   Writing query sequences to query.fasta.
2019-05-06 09:51:54: INFO   Performing all-vs.-all Blastp (this might take a while)
2019-05-06 09:51:54: INFO   Making Blastdb

Building a new DB, current time: 05/06/2019 09:51:54
New DB name:   wgd__mcl_blast_output_Frie_2/37524d2d3fc116.db.fasta
New DB title:  wgd__mcl_blast_output_Frie_2/37524d2d3fc116.db.fasta
Sequence type: Protein
Keep Linkouts: T
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 18027 sequences in 1.24904 seconds.
2019-05-06 09:51:55: INFO   Running Blastp
2019-05-06 09:51:55: INFO   blastp -db wgd__mcl_blast_output_Frie_2/37524d2d3fc116.db.fasta -query wgd__mcl_blast_output_Frie_2/37524d2d46ce94.query.fasta -evalue 1e-10 -outfmt 6 -num_threads 4 -out wgd__mcl_blast_output_Frie_2/F.endolithicus.cds.fasta.blast.tsv

Thanks, Claudia

arzwa commented 5 years ago

Hi Claudia, yes this step can take a while (it is an all-versus.-all blastp search, blasting each gene to the full set of all genes). Hopely the analyses run without further troubles.

Best, Arthur