arzwa / wgd

Python package and CLI for whole-genome duplication related analyses. This package is deprecated in favor of https://github.com/heche-psb/wgd.
http://wgd.readthedocs.io/en/latest/
GNU General Public License v3.0
81 stars 41 forks source link

No .ovo.tsv file produced after one_v_one mcl step #10

Closed BiodivGenomic closed 5 years ago

BiodivGenomic commented 5 years ago

Hi, sorry to have an other issue, I'm testing each function of wgd extensively ;-) When I run wgd mcl for paranome, I get a file "xxxx.blast.tsv.mcl" that I use for the next step (ksd). If i want to get the one-to-one orthologs among different species, i don't get this file, right ? I tried between my species and Amborella, and I got a file "xxxxxx.ovo.tsv" that I used in step two... But when I tried with other species, I only get the "xxxxx.blast.tsv" file, that I cannot use without error in the second step... I didn't get any error message during the first step, just no ovo.tsv. Do you know what could be the cause of this, and how to modify the .blast.txv file to use it in the second step (I suspect it only needs minor edits to be usable) ? Thanks a lot in advance !

arzwa commented 5 years ago

Hi, no problem, I obviously want the program to be as bug free as possible, so please do indicate when you have issues! Indeed the xxxx.ovo.tsv file (ovo for 'one vs. one') is the file with one-to-one orthologs. Normally you should always have such a file when using wgd mcl with the --one_v_one flag (and two different fasta files as sequences (e.g. -s fasta1,fasta2). The only case where the described behavior could happen as far as I see is when there are no reciprocal best hits found (I should modify the code to give a warning in that case). The latter case could easily happen if you used small test fasta files, or two times the same file (a hit of a gene to itself is not recorded).

Another possible problem (although I'm not sure whether it still is, I should have a look) is that some gene IDs contain pipe character (|). In general, these should be avoided when using wgd.

BiodivGenomic commented 5 years ago

Indeed, the pipe character was likely causing the issue... maybe either add a warning when it's found in the input file, or remove them in the process of formatting the input files (just some ideas for a future update) ? :-)