Illumina / interop

C++ Library to parse Illumina InterOp files
http://illumina.github.io/interop/index.html
GNU General Public License v3.0
75 stars 26 forks source link

What files/directories are actually needed as input? #333

Closed nick-youngblut closed 7 months ago

nick-youngblut commented 7 months ago

At least most of the docs/tutorials demonstrate specifying the entire run folder as input.

However, for pipelines such as Nextflow, it is much more efficient to specify only the specific input files required.

Given that the package is labeled "interop", one would assume that only the InterOp directory (or just the main .bin files in the directory, such as SummaryRunMetricsOut.bin) are needed, but it is not clear how to just read InterOp directory or specific .bin files with the python wrapper.

For instance, if I use the summary.py example (after updating to Python 3), I get:

2023-12-09 21:06:24,486 - Skipping - cannot read RunInfo.xml:  - No format found to parse ErrorMetricsOut.bin with version: 6 of 3
/io/./interop/io/metric_stream.h::read_metrics (111)

However, the input directory that I specified contains:

|-- InterOp
`-- RunInfo.xml

The RunInfo.xml is present, and all *.bin files are in the InterOp directory:

InterOp/AlignmentMetricsOut.bin
InterOp/BasecallingMetricsOut.bin
InterOp/CorrectedIntMetricsOut.bin
InterOp/EmpiricalPhasingMetricsOut.bin
InterOp/ErrorMetricsOut.bin
InterOp/EventMetricsOut.bin
InterOp/ExtendedTileMetricsOut.bin
InterOp/ExtractionMetricsOut.bin
InterOp/FWHMGridMetricsOut.bin
InterOp/ImageMetricsOut.bin
InterOp/InsertSizeMetricsOut.bin
InterOp/OpticalMetricsOut.bin
InterOp/OpticalModelMetricsOut.bin
InterOp/PFGridMetricsOut.bin
InterOp/QMetrics2030Out.bin
InterOp/QMetricsByLaneOut.bin
InterOp/QMetricsOut.bin
InterOp/RawFWHMGridMetricsOut.bin
InterOp/ReconstructionMetricsOut.bin
InterOp/SummaryRunMetricsOut.bin
InterOp/SweepMetricsOut.bin
InterOp/TileMetricsOut.bin
nick-youngblut commented 7 months ago

Moreover, summary.py generates nothing for MiSeq runs. If I add:

        print(f"Summary size: {summary.size()}")
        print(f"Summary lane count: {summary.lane_count()}")
        print(f"Summary surface count: {summary.surface_count()}")

I get:

Summary size: 0
Summary lane count: 0
Summary surface count: 0

The MiSeq run folder that I'm using contains all output files for a successful MiSeq run.

The RunInfo.xml file shows that the counts should be 1 and not 0:

<?xml version="1.0"?>
<RunInfo xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Version="2">
  <Run Id="XXX" Number="45">
    <Flowcell>XXX</Flowcell>
    <Instrument>XXX</Instrument>
    <Date>231205</Date>
    <Reads>
      <Read NumCycles="151" Number="1" IsIndexedRead="N" />
      <Read NumCycles="151" Number="2" IsIndexedRead="N" />
    </Reads>
    <FlowcellLayout LaneCount="1" SurfaceCount="1" SwathCount="1" TileCount="2" />
  </Run>
</RunInfo>
ezralanglois commented 7 months ago

Addressing the first concern, the error message reported in Python below is incorrect. It should have said just Skipping and not Skipping - cannot read RunInfo.xml.

The second part of the error is the important bit No format found to parse ErrorMetricsOut.bin with version: 6 of 3

This means that you are trying to parse version 6 of the ErrorMetricsOut.bin with a version of the interop library that only supports up to version 3. Upgrading the interop library will address this issue.

https://github.com/Illumina/interop/blob/b3a1089759f3a6b3dd437eb147f75a7ffc1db9b6/src/examples/python/summary.py#L34-L37

ezralanglois commented 7 months ago

The second issue sounds like a bug. It may be in the older version of interop you are using based on the previous issue, or it may still be in the library. I will need to investigate this.

ezralanglois commented 7 months ago

I cannot reproduce this issue with a local MiSeq run and the latest version of the library.

nick-youngblut commented 7 months ago

Upgrading the interop library will address this issue.

$ pip install interop==1.3.0
ERROR: Could not find a version that satisfies the requirement interop==1.3.0 (from versions: 1.1.18, 1.1.19, 1.1.21, 1.1.22, 1.1.23)

I'm using Ubuntu 22.04 & python 3.9.19. My python env:

Package    Version
---------- -------
numpy      1.26.2
pip        23.3.1
setuptools 68.2.2
wheel      0.41.3

Based on the setup.py.in file, it seems like version 1.3.0 should be compatible with my environment.

Note: Installation of interop v1.3.0 via bioconda doesn't install the python package.

ezralanglois commented 7 months ago

Look like there is a bug when building the Python 3.9 for manylinux. That wheel is missing in PyPI.

I will have a PR out to fix that.

As for bioconda, we don't support that and I don't know much about it.

As for the list of files, the InterOp files listed on this site plus the RunInfo.xml are required.

You can load individual files, but we don't document that route and I don't recommend it.