error parsing the blastxml in mifish/core/pipeline.py

Hi There. I was getting no hit results back and I noticed that the percent identities and #miss-matches being reported in "haploids with low identities" tab on the output taxonomy spreadsheet didnt make any sense.

Looking at pipeline.py line 256-261

# /core/pipeline.py
for alignment in blast_record.alignments:
    hsp = alignment.hsps[0]
    aln_len = alignment.length
    identity = hsp.identities/aln_len
    if identity >= blast_identity/100:
        good_alns.append(alignment)

For me aln_len is reporting the length of the hit record in the database, not the HSP overlap length. This means the identity number is really much smaller than it should be. I fixed it by assigning aln_len to hsp.align_length (see below).

#/core/pipeline.py
for alignment in blast_record.alignments:
    hsp = alignment.hsps[0]
    aln_len = hsp.align_length #alignment.length
    identity = hsp.identities/aln_len
    if identity >= blast_identity/100:
        good_alns.append(alignment)

Now I get correct reporting on the identity because it is dividing by the HSP length and not the hit record length.

This also needs to be fixed on lines 266 and 289 (moving it below the hsp assignment which occurs on line 269 and 292, respectively)

billzt / MiFish

error parsing the blastxml in mifish/core/pipeline.py #2