apetkau / microbial-informatics-2014

Microbial Whole Genome Sequence data analysis labs for 2014
29 stars 13 forks source link

problem in stage 9 parseBlast.log #3

Open sujanpau opened 4 years ago

sujanpau commented 4 years ago

Hello!!

I have converted the protein sequences into the OrthoMCL readable format using bioawk tool. My program is running but while at the stage 9 it shows me that the parseBlast.log command failed to execute. Is there any way we can solve this problem?

Thank you very much in advance!!

apetkau commented 4 years ago

Hello @sujanpau. Could you paste the specific error message you are getting here?

sujanpau commented 4 years ago

=Stage 9: Parse Blast Results= cat /home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_results/blast_results.* > /home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_load/all.fasta /home/mohammad/Documents/orthomcl/orthomclsoftware-custom/bin/orthomclBlastParser "/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_load/all.fasta" "/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/compliant_fasta" 1>/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_load/similarSequences.txt 2>/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/log/9.parseBlast.log Error executing command: /home/mohammad/Documents/orthomcl/orthomclsoftware-custom/bin/orthomclBlastParser "/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_load/all.fasta" "/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/compliant_fasta" 1>/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_load/similarSequences.txt 2>/home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/log/9.parseBlast.log. See logs /home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/blast_load/similarSequences.txt and /home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/log/9.parseBlast.log

apetkau commented 4 years ago

Thanks. Could you also post the output of /home/mohammad/Documents/orthomcl/orthomcl-pipeline/orthomcl-output-small12/log/9.parseBlast.log.

Also, were these your own files, or from the tutorial? And are you using your own machine, or the tutorial virtual machine?

sujanpau commented 4 years ago

These are the files from NCBI database. I am using my own linux machine with dual operating system. The output of the file says acquiring genes from Po82.fasta acquiring genes from GMI1000.faa 'GMI1000.faa' is not in 'taxon.fasta' format

I am a beginner and I donot have any comprehensive knowledge or skills about linux or programming.

I appreciate your consideration.

sujanpau commented 4 years ago

When I am keeping the taxon name same for the nucleotide and protein sequences, I am not able to keep the .fasta extension as the file name is same for both sequences. When I am keeping name different for both sequences and keep the extension same as .fasta, the program is saying taxon name is not same and the process is aborted.

Is there any thing I can do in the input file to make it working?

Thank you very much!!

apetkau commented 4 years ago

Hello @sujanpau. Sorry for the later response.

OrthoMCL requires only the amino acid files and they must end with .fasta. I think what you'll have to do is make a new directory with only the amino acid sequence files so that they can all be renamed to end in .fasta and then run OrthoMCL on that directory. I hope that makes sense?

sujanpau commented 4 years ago

Thank you very much for the reply. I am trying to figure out the protein Ids of the core genes in between the genomes i am comparing. As you can see in the example below, OrthoMCL is just giving me that the number of core genes is 40, and then it specifies the unique genes for both genomes but not the core genes. Could you please help me in regards to this matter, there is maybe a command to get exactly the core genes? Thank you so much beforehand. I would be looking forward for knowing about this. I wish you an excellent weekend and I hope you have had a happy Thanksgiving day.

Number of genes seen in the following genomes:

CnebraskensisA6096: 2733 CphaseoliLPPA982_pCP: 114

Total genes seen: 2847

'Core' gene sets that is contained: 2 genomes has 40 genes

apetkau commented 4 years ago

You're welcome.

The core gene set can be extracted from the groups.txt file produced by OrthoMCL (https://github.com/apetkau/microbial-informatics-2014/tree/master/labs/orthomcl#step-7-looking-at-the-results).

Here, you're going to see a list of closely-related genes, one group per line. The core gene set consists of every line where every genome you are investigating appears. You can extract this with the command grep if you wanted. For example:

grep 'CnebraskensisA6096' groups.txt | grep 'CphaseoliLPPA982_pCP'

This should give you all the lines where both CnebraskensisA6096 and CphaseoliLPPA982_pCP appear, which should be your core gene set.