Illumina / interop

C++ Library to parse Illumina InterOp files
http://illumina.github.io/interop/index.html
GNU General Public License v3.0
75 stars 26 forks source link

How to get >=q30 cluster count per lane #226

Closed yelekley closed 4 years ago

yelekley commented 4 years ago

Hello, Please suggest how to get >=q30 cluster count per lane. Thank you

ezralanglois commented 4 years ago

Just checking, what sequencer is this for?

Also, InterOp support C++, Python and C#. What language do you want to get the metric using?

yelekley commented 4 years ago

Novaseq, Miseq and Nextseq. I prefer Python, Thanks

ezralanglois commented 4 years ago

Q30 is defined on the basecalls, not the clusters. A metric like PF would be on the clusters.

So, I assume you want the total number of bases calls (e.g. cluster_count*cycle_count)

Here is an example that will allow you to load the summary metrics (follow this up until In [8]:): https://github.com/Illumina/interop/blob/044fbc86ec32b3c079af37ad57f5ceae273d1c5c/docs/src/Tutorial_01_Intro.ipynb

        for read_index in range(summary.size()):
            for lane_index in range(summary.lane_count()):
                 fraction_gt_q30 = summary.at(read_index).at(lane_index).percent_gt_q30().mean()/100
                 yield_g = summary.at(read_index).at(lane_index).yield_g().mean()
                 bases_gt_q30 =  fraction_gt_q30*yield_g*1e9
                 lane_number = summary.at(read_index).at(lane_index).lane()

bases_gt_q30 gives you the total number of called bases that are >= Q30. lane_number gives you the corresponding lane

yelekley commented 4 years ago

Thank you so much for your help. Now I just need to get ChipResultsSummary's yield that is listed in Bustard. I found an example how to parse Tile Metric binary file and got the total clusterCountPF and clusterCountRaw for the run. I just need to figure out to get the yield metric and I can avoid parsing BustardSummary.xml file. Is it accessible by yield_g function? Thanks

ezralanglois commented 4 years ago

Yes

On Sat, Aug 22, 2020, 11:07 AM yelekley notifications@github.com wrote:

Thank you so much for your help. Now I just need to get ChipResultsSummary's yield that is listed in Bustard. I found an example how to parse Tile Metric binary file and got the total clusterCountPF and clusterCountRaw for the run. I just need to figure out to get the yield metric and I can avoid parsing BustardSummary.xml file. Is it accessible by yield_g function? Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Illumina/interop/issues/226#issuecomment-678673023, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQB4LSW3IEKZDWO3T2TUZTSCACMLANCNFSM4QDWFVTQ .

yelekley commented 4 years ago

Returning an error, no attributes 'mean'. fraction_gt_q30 = summary_lane.at(read_index).at(lane_index).percent_gt_q30().mean()/100 AttributeError: 'float' object has no attribute 'mean' thanks

yelekley commented 4 years ago

Another question if you don't mind... When I calculate the total yield and the projected yield for the run. The numbers of the projected yield are very close to what's in BustardSummary under chip result summary. The sum of yield_g*1e9 is always less than what's in BustardSummary. So it's the projected yield that is in Bustard, and not the actual yield, do you know? Thanks, Here is the code:

tyield=0
pyield=0
for read_index in range(summary.size()):
    for lane_index in range(summary.lane_count()):
        yield_g = summary.at(read_index).at(lane_index).yield_g()
                 yieldtotal=yield_g*1e9
                 tyield += yieldtotal
                 yield_p = summary.at(read_index).at(lane_index).projected_yield_g()
                 projyield = yield_p*1e9
                 pyield += projyield
print(tyield)
print(pyield)
nudpa commented 4 years ago

@yelekley %Q30 is stored as the aggregate across all tiles in the lane directly (in other words, we already calculate a weighted average across tiles when you call .percent_gt_q30()), so it should work to just remove the .mean() from your expression above.

As far as yield vs. projected yield, yield represents the estimated amount of non-N bases that have been processed so far, while projected yield represents the expected amount of non-N bases by the end of the run. If a run is successfully completed, both should converge to the same value. If the run is still in progress, yield will be less than projected yield.