Illumina / interop

C++ Library to parse Illumina InterOp files
http://illumina.github.io/interop/index.html
GNU General Public License v3.0
75 stars 26 forks source link

InterOp-summary not returning a properly formatted csv #337

Closed nick-youngblut closed 3 months ago

nick-youngblut commented 6 months ago

summary --csv=1 produces a file that is CSV-like, but includes 2 extra rows, which seem unnecessary -- especially since they make the CSV non-standard:

# Version: vX.X.X
InterOp
[the actual table]
ezralanglois commented 6 months ago

CSV parsers tend to be fairly flexible, supporting both skipping headers and supporting comment characters.

For example: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

Plus, there are many tools in both Linux and Windows to remove these additional lines. For example on Linux sed, awk and on Windows Get-Content.

nick-youngblut commented 6 months ago

@ezralanglois parsing the following "csv" created via --csv=1 is not trivial:

# Version: v1.3.1
101010_M99999_0000_000000000-LBLRG
Level,Yield,Projected Yield,Aligned,Error Rate,Intensity C1,%>=Q30,% Occupied
Read 1,0.43,0.43,0.00,nan,400,98.37,nan
Read 2 (I),0.16,0.16,0.00,nan,1128,85.04,nan
Non-indexed,0.43,0.43,0.00,nan,400,98.37,nan
Total,0.59,0.59,0.00,nan,764,94.78,nan

Read 1
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing  slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),% Occupied,Intensity C1
1,-,38,1000 +/- 28,94.26 +/- 1.12,0.215 / 0.004,nan / nan,nan / nan,24.12,22.73,98.37,0.43,0,0.00 +/- 0.00,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,400 +/- 38
1,1,19,1016 +/- 26,93.75 +/- 1.00,0.216 / 0.002,nan / nan,nan / nan,12.24,11.47,98.30,0.22,-,0.00 +/- 0.00,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,434 +/- 21
1,2,19,983 +/- 19,94.78 +/- 1.00,0.213 / 0.005,nan / nan,nan / nan,11.88,11.26,98.44,0.21,-,0.00 +/- 0.00,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,366 +/- 8
Read 2 (I)
Lane,Surface,Tiles,Density,Cluster PF,Legacy Phasing/Prephasing Rate,Phasing  slope/offset,Prephasing slope/offset,Reads,Reads PF,%>=Q30,Yield,Cycles Error,Aligned,Error,Error (35),Error (75),Error (100),% Occupied,Intensity C1
1,-,38,1000 +/- 28,94.26 +/- 1.12,0.000 / 0.000,nan / nan,nan / nan,24.12,22.73,85.04,0.16,0,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,1128 +/- 121
1,1,19,1016 +/- 26,93.75 +/- 1.00,nan / nan,nan / nan,nan / nan,12.24,11.47,85.23,0.08,-,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,1242 +/- 37
1,2,19,983 +/- 19,94.78 +/- 1.00,nan / nan,nan / nan,nan / nan,11.88,11.26,84.85,0.08,-,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,nan +/- nan,1013 +/- 33
Extracted: 28
Called: 28
Scored: 28

It would be helpful to at least know all of the possible "section" titles, such as Read 1 and Read 2 (I).

ezralanglois commented 6 months ago

The section titles will depend on the read structure in the RunInfo.xml. For example, if Read 2 is an index read, then it will have the (I) after it. A genomic read will not.

Alternatively, you can generate your own custom CSV with python like so

import interop.core as ic
import pandas as pd
import sys
run_folder = sys.argv[1]
output_csv_file = sys.argv[2]
df = pd.DataFrame(ic.summary(run_folder))
df.to_csv(output_csv_file)

The different levels of summary and other options are described here: https://github.com/Illumina/interop/blob/520e8ab8a5a3f3d5fa44ab4d32643fb7c6da0b30/src/ext/python/core.py#L217