arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

wgd mcl error #19

Closed qiuxx221 closed 5 years ago

qiuxx221 commented 5 years ago

Hi,

I am running the command

wgd mcl --cds -s NR_Error_corrected_K19_rename.fasta -o ./ -n 8

using PacBio Isoform sequencing data, but for some reasons the clustering step didn't produce mcl file. Do you know what the problem is?

Part of the error msg is below,

wgd mcl --cds -s NR_Error_corrected_K19_rename.fasta -o ./ -n 8
2019-07-16 13:40:11: INFO   makeblastdb: 2.9.0+
 Package: blast 2.9.0, build Mar 11 2019 15:20:05
2019-07-16 13:40:11: INFO   blastp: 2.9.0+
 Package: blast 2.9.0, build Mar 11 2019 15:20:05
2019-07-16 13:40:11: INFO   CDS sequences provided, will first translate.
Invalid codon gaa in transcript/0
Sequence length != multiple of 3 for transcript/6137!                                                                                     
Invalid codon gTC in transcript/6137
Invalid codon agt in transcript/4090
Sequence length != multiple of 3 for transcript/2047!
Invalid codon ggc in transcript/2047
.
.
.
Ignoring sequence 'lcl|38592' as it has no sequence data
Ignoring sequence 'lcl|38593' as it has no sequence data
Ignoring sequence 'lcl|38594' as it has no sequence data
Ignoring sequence 'lcl|38595' as it has no sequence data

In the end, I have the blast-blast tsv file only...

arzwa commented 5 years ago

First of all, using wgd only makes sense for CDS (coding DNA) sequences, so make sure that the data you provide consists of nice strings of DNA that can be translated into proteins. Secondly, you might have some issues with your sequence IDs. In general it's best to avoid pipe characters (|) in sequence IDs and note that everything after the first space is ignored.

qiuxx221 commented 5 years ago

Thanks for your reply. In terms of the coding DNA sequence, I am using de novo transcriptome, does it mean maybe I should do Trandecoder first to know which sequences encode protein? Does it matter if it has a full ORF or it's ok just to be 5' partial?

Thanks!

arzwa commented 5 years ago

I don't have a lot experience with analyzing transcriptomes, so I'm afraid I can't be of a lot help here, but yes, you absolutely need a protein coding DNA sequence, since Ks is a distance defined at the codon level, and wgd uses codon-level alignments and codon models as implemented in codeml to compute Ks distances. I guess you can provide a 5' partial ORF, since wgd will translate codon by codon starting from the start of the sequence and will stop at the first stop codon or when the sequence terminates (ignoring the last nucleotides if the sequence length is not a multiple of 3)