chanzuckerberg / czid-workflows

Portable WDL workflows for CZ ID production pipelines
https://czid.org/
MIT License
37 stars 7 forks source link

Calculation of bit score and E-value from minimap2 output, as well as count data query #373

Open charlesfoster opened 3 months ago

charlesfoster commented 3 months ago

Hi there,

I've been looking under the hood a bit to try and figure out how the pipeline assigns counts from reads to different taxonomic ranks. Necessarily, this has also involved looking at how the bit score and E-value for hits are calculated based on the output of minimap2.

I came across the following class:

class QualityCalculations:
    def __init__(self, genome_size=832400799511):
        self._k = 0.1
        self._lambda = 1.58
        self.genome_size = genome_size

    def calc_bitscore(self, alen, nonmatch):
        score = alen - 2 * nonmatch
        return (score * self._lambda - math.log(self._k)) / math.log(2.0)

    def calc_evalue(self, alen, nonmatch):
        # we want to keep -self._lambda * score negative otherwise we could start
        #   getting overflow errors. So we don't let score go below 0.
        score = max(0, alen - 2 * nonmatch)
        return self._k * alen * self.genome_size * math.exp(-self._lambda * score)

    def calc_gap_openings(self, cigar):
        go = 0
        for char in cigar:
            if char == "I" or char == "D":
                go += 1
        return go

A few questions related to this: (1) Why is the raw score input to the bit score calculation equation taken to be the aligned length - 2 * number of mismatches? Couldn't the alignment score (AS) from minimap2 be used for this purpose? Also, could not the AS score itself be used for ranking the hits? (https://github.com/lh3/minimap2/issues/141#issuecomment-378684214)

(2) How were the values for lambda, k and "genome size" (database size?) chosen?

Related, could you please explain briefly how, after converting the PAF-format file to m8 format, the results are interrogated to determine the optimal taxon assignment for each read? (see: https://github.com/chanzuckerberg/czid-workflows/blob/a7c54be041e9fbd4cfdd9fac99226e6b4f6f3bfd/lib/idseq-dag/idseq_dag/util/m8.py). I tried to follow the code but got a little stuck. Would I be correct in guessing that (for each read) hits are scoured to find those with annotations at the species level, then the most common species for the read is taken to be the best? How are bit scores etc. fed into this process?

While comparing counts across pipelines, I found a greater count for one genus of interest with czid (run through the web). As you can see here, the genus Roseolovirus had a count of 9356 reads (based on search against nt database):

czid_count_query

I clicked to download the reads associated with this genus: image

But there are only 4183 reads in the file:

$ grep -c ">" sample1_roseolovirus-hits.fasta
4183

What explains this discrepancy (9356 vs 4183)? Am I correct that it might be via the generate_taxon_count_json_from_m8() function, whereby the counts of duplicate reads are added back in?

Thanks for your time!