billzt / MiFish

This is the command line version of MiFish pipeline. It can also be used with any other eDNA meta-barcoding primers
https://mitofish.aori.u-tokyo.ac.jp/mifish/
GNU General Public License v3.0
15 stars 3 forks source link

error parsing the blastxml in mifish/core/pipeline.py #2

Closed dwheelerau closed 1 year ago

dwheelerau commented 1 year ago

Hi There. I was getting no hit results back and I noticed that the percent identities and #miss-matches being reported in "haploids with low identities" tab on the output taxonomy spreadsheet didnt make any sense.

Looking at pipeline.py line 256-261

# /core/pipeline.py
for alignment in blast_record.alignments:
    hsp = alignment.hsps[0]
    aln_len = alignment.length
    identity = hsp.identities/aln_len
    if identity >= blast_identity/100:
        good_alns.append(alignment)

For me aln_len is reporting the length of the hit record in the database, not the HSP overlap length. This means the identity number is really much smaller than it should be. I fixed it by assigning aln_len to hsp.align_length (see below).

#/core/pipeline.py
for alignment in blast_record.alignments:
    hsp = alignment.hsps[0]
    aln_len = hsp.align_length #alignment.length
    identity = hsp.identities/aln_len
    if identity >= blast_identity/100:
        good_alns.append(alignment)

Now I get correct reporting on the identity because it is dividing by the HSP length and not the hit record length.

This also needs to be fixed on lines 266 and 289 (moving it below the hsp assignment which occurs on line 269 and 292, respectively)

billzt commented 1 year ago

Hi @dwheelerau , fixed. Thank you!