Illumina / interop

C++ Library to parse Illumina InterOp files
http://illumina.github.io/interop/index.html
GNU General Public License v3.0
75 stars 26 forks source link

Interop reads PF does not match demux output #250

Closed rschmieder closed 3 years ago

rschmieder commented 3 years ago

Using the python library, I can use the following to parse the number of PF reads from the interop:

run_metrics = py_interop_run_metrics.run_metrics()
run_metrics.read(source_dir)
summary = py_interop_summary.run_summary()
py_interop_summary.summarize_run_metrics(run_metrics, summary)

lanes = summary.lane_count()
readNo = 0
total_reads_pf = 0
for lane in range(lanes):
    read = summary.at(readNo).at(lane)
    total_reads_pf += read.reads_pf()

The resulting number for total_reads_pf does not match the number of reads in the Demultiplex_Stats.csv output file. Sometimes it's higher, sometimes it's lower. The overall difference is small.

Is this expected because the value is stored as a float (based on http://illumina.github.io/interop/classillumina_1_1interop_1_1model_1_1summary_1_1stat__summary.html) and the difference is caused by rounding errors?

ezralanglois commented 3 years ago

Have you checked the PF cluster count? I would expect that to match.

The percentage could be off due to rounding errors.

rschmieder commented 3 years ago

The value for reads_pf is a number, not a percentage. Do you have a code example on how to get the PF cluster count you suggested? I only see cluster_count_pf() for read and that is of type metric_stat, which doesn't seem to provide a count or number (only mean, median, stddev).

ezralanglois commented 3 years ago

Ah, apologies. I had it in my head that was the percentage, not the count (we should really have that in the name). That should match. Let me look into it.

ezralanglois commented 3 years ago

Ok, yes you are correct. metric_stat just provides the mean per lane (and other less useful stats). So summing it up, only gives you the sum of the mean of each lane, which won't be the same as the cluster count.

Here is one way to get the PF cluster count you are looking for:

run_metrics = py_interop_run_metrics.run_metrics()
run_metrics.read(source_dir)
pf_cluster_count = 0
tile_metric_set = run_metrics.tile_metric_set()
for i in range(tile_metric_set.size()):
    pf_cluster_count += tile_metric_set.at(i).cluster_count_pf()
print(pf_cluster_count)

There are some other ways, but they require considerably more code and are only interesting if you want to get many metrics.

rschmieder commented 3 years ago

Thank you, that example generates numbers matching the demux output. Any idea why the reads_pf generates a different value? Are there other metrics that should be parsed using tile_metric_set instead of run_summary?

ezralanglois commented 3 years ago

Sorry, I have not dug around in this part of the code base for a while. Those values should match.

This is a bug. I just reproduced it for a NextSeq2k run. We can fix this.

1403945256.0
1403945152.0
rschmieder commented 3 years ago

👍