Illumina / interop

C++ Library to parse Illumina InterOp files
http://illumina.github.io/interop/index.html
GNU General Public License v3.0
75 stars 26 forks source link

`summary(run_metrics, 'Lane')` : Cluster vs Read count #267

Closed sklages closed 3 years ago

sklages commented 3 years ago
import pandas as pd
import interop as iop

run_folder = '/path/to/run_folder'  # S1 flowcell, SR100

run_metrics = iop.read(run_folder)

ar = iop.summary(run_metrics, 'Lane')
df = pd.DataFrame(ar)
df[df.ReadNumber == 1]

I have a question regarding:

Cluster Count            float32
Cluster Count Pf         float32
Reads                    float32
Reads Pf                 float32

The above snippet gives me the following values (lane 1 / read 1 as an example):

Cluster Count            4091904.0
Cluster Count Pf         3447110.5
Reads                    1.276674e+09
Reads Pf                 1.075498e+09

Where do the Cluster values come from? I thought the actual "cluster count" equals to "raw reads count"? So I'd expect Cluster Count Pf == Reads Pf. Where am I wrong?

ezralanglois commented 3 years ago

We have to be careful with Reads. In some versions of InterOp is scaled by the number of reads in the run (generally 2x the cluster count). We fixed this issue in recent versions of InterOp, but if you got back far enough, then you will see that extra scaling.

Also, the summary logic is summing all the tiles a lane. The S1 flowcell has 312 tiles per lane.

If your Cluster Count above comes from a single row in the imaging table, then it corresponds to a single tile and it should be 312*Cluster Count = Reads

sklages commented 3 years ago

ha, okay. Thanks for the info, now the numbers make sense :-)

sklages commented 3 years ago

@ezralanglois .. just to avoid confusion:

If your Cluster Count above comes from a single row in the imaging table

it is the result of summary(run_metrics, 'Lane') for the first read (1).