apetkau / orthomcl-pipeline

Automates running of OrthoMCL software from http://orthomcl.org/common/downloads/software/v2.0/
80 stars 36 forks source link

Is there a problem with my input file? #29

Closed jingydz closed 5 years ago

jingydz commented 5 years ago

=Stage 1: Validate Files = Validating mfilter.fasta ... 47599 sequences Error: file /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/ceshi2_in/bfilter.fasta contains a sequence (TRINITY_DN17801_c2_g3_i1.p2) containing non-protein alphabet (dna) at /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/bin/../scripts/orthomcl-pipeline.pl line 357, line 50938. Validating bfilter.fasta ... The above is my running process reported wrong. I looked at the sequence and didn't see anything wrong, but just delete the sequence and I can run my file completely. Although I got the output file, now I'd like to ask why stage1 reported the error?

apetkau commented 5 years ago

My best guess as to why you got the error is that TRINITY_DN17801_c2_g3_i1.p2 is using a DNA alphabet instead of an amino acid alphabet (or at least, this is what BioPerl detected). OrthoMCL requires protein sequences as a string of amino acids. I cannot really give much more information than this without seeing that particular sequence.

jingydz commented 5 years ago

Terribly sorry.I forgot to upload the error sequence. 101995 >TRINITY_DN17801_c2_g3_i1.p2 type:internal len:105 gc:universal TRINITY_DN17801_c2_g3_i1:3-314(+) 101996 GCGYYSGGSGGGSSCGGGSSGGGSSCGGGGGGSYGGGSSCGGGGGSGGGVKYSGGGGSSCGGGYSGGGGSSCGGGYSGGGGGSSCGGGSSGGGSSCGGGGGSGG There are two sets of protein sequences have this problem, this is someone sent me the test file, let me help her test, so I do not know whether the file itself error. To keep the program running properly, I deleted these two lines. Is there a problem?

apetkau commented 5 years ago

Okay, I think I have an idea of what's going on. The BioPerl automatic detection of the file type is confused because the sequence data in that record could either be dna or protein.

I've added a small fix for this in https://github.com/apetkau/orthomcl-pipeline/pull/30. I am wondering if you can test this out to make sure it fixes your problem? The new code should be in branch fix-invalid-alphabet.

jingydz commented 5 years ago

Hi, Iā€˜m so sorry for my delays in replying to your letter, because my teacher's server made some mistakes and I was busy in my study a few weeks ago. Thank you very much again, it fixed my problem. šŸ˜­šŸ˜­šŸ˜­ But I have another problem, someone asked me to run a set of data for her, but it went wrong in step 9. The error message is as follows: Stage 8 took 3532.28 minutes done
=Stage 9: Parse Blast Results= cat /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_results/blast_results.* > /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_load/all.fasta /data/users/zhangjingjing/OrthoMCL/orthomclSoftware-v2.0.9/bin/orthomclBlastParser "/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_load/all.fasta" "/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/compliant_fasta" 1>/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_load/similarSequences.txt 2>/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/log/9.parseBlast.log Error executing command: /data/users/zhangjingjing/OrthoMCL/orthomclSoftware-v2.0.9/bin/orthomclBlastParser "/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_load/all.fasta" "/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/compliant_fasta" 1>/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_load/similarSequences.txt 2>/data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/log/9.parseBlast.log. See logs /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/blast_load/similarSequences.txt and /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/log/9.parseBlast.log

And I checked the error log: [root@GenEngine 20190423_out]# cat /data/users/zhangjingjing/OrthoMCL/orthomcl-pipeline/20190423_out/log/9.parseBlast.log acquiring genes from b.fasta acquiring genes from f.fasta acquiring genes from m.fasta acquiring genes from musfinalpep.fasta couldn't find taxon for gene 'musfinalpep|ENSMU' at /data/users/zhangjingjing/OrthoMCL/orthomclSoftware-v2.0.9/bin/orthomclBlastParser line 105, line 29512073. [root@GenEngine 20190423_out]#

Sorry to bother you again, but I am only a sophomore who has just come to study bioinformatics for a few months, so I don't have much knowledge reserve. Can you help me with this problem? What's more, can I only run from step 1 again? Because it takes so long, can I just run it from step 9? I will be very appreciated if you could reply to me.

apetkau commented 5 years ago

No problem.

What does the file musfinalpep.fasta look like? It may be the cast that the fasta sequence entries in this file are not formatted correctly for OrthoMCL. If it's possible, could you send me the file (you can email it to me if you wish).

And no, there is no way to run just from step 9.