GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
125 stars 24 forks source link

tama parser with diamond result not picking the top hit #108

Closed olechnwin closed 10 months ago

olechnwin commented 1 year ago

Hi Richard,

I am hoping you can help with this. I'm trying to run diamond instead of blastp since it takes a long time to run blastp even after splitting the fasta file into 10k fasta files. The problem is tama parser doesn't seem to pick up the top hits even though the result from diamond and blastp looks to be similar to me. I cannot figure out why tama parser doesn't work on diamond result. Hope you can give me some insights. These are the blastp result and the parsed blastp result: blastp_ensembl_rslt.txt blastp_ensembl_rslt_parsed.txt

These are the diamond result and the parsed diamond result: diamond_rslt.txt diamond_rslt_parsed.txt

As shown in blastp_ensembl_rslt_parsed.txt, tama parser was able to pick out the full_match while in diamond_rslt_parsed.txt somehow tama parser was picking out the bad_match. There was no full_match in the file.

Thank you in advance for any insights you can give me. Cen

GenomeRIK commented 10 months ago

Hi Cen,

Sorry for the very delayed response!

I have never really tried using the parser on Diamond outputs but the parser was quite difficult to build in order to make sure it did not mess up the nuanced formatting of the blastp output. I suspect Diamond uses slightly different space characters which is throwing the parser off.

Unfortunately, I have not been able to code in some while due to starting my own company so I do not have time to build a parser for Diamond but it might be good to see if anyone has built a Diamond to Blastp convertor. Or contact the Diamond developers to see how they have deviated from proper blastp formatting.

Sorry I could not be of more help.

Thank you, Richard