Illumina / interop

C++ Library to parse Illumina InterOp files
http://illumina.github.io/interop/index.html
GNU General Public License v3.0
75 stars 26 forks source link

interop imaging table header doesn't match data #257

Closed matthdsm closed 3 years ago

matthdsm commented 3 years ago

Hi,

I'm trying to parse the output from the imaging table application, but the output header fields don't seem to match the data.

example

# Version: v3.0.35-src
# Run Folder: 210311_A00785_0173_BHVYTNDSXY
Lane,Tile,Cycle,Read,Cycle Within Read,Density(k/mm2),Density Pf(k/mm2),Cluster Count (k),Cluster Count Pf (k),% Pass Filter,% Aligned,Legacy Phasing Rate,Legacy Prephasing Rate,Error Rate,%>= Q20,%>= Q30,P90<RED;GREEN>,% No Calls,% Base<A;C;G;T>,Fwhm<RED;GREEN>,Corrected<A;C;G;T>,Called<A;C;G;T>,Signal To Noise,Phasing Weight,Prephasing Weight,Phasing Slope,Phasing Offset,Prephasing Slope,Prephasing Offset,Minimum Contrast<RED;GREEN>,Maximum Contrast<RED;GREEN>,Surface,Swath,Tile Number,Cluster Count Occupied (k),% Occupied
1,1101,1,1,1,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,98.15000153,94.79000092,1510,1143,0,32.70000076,21.70000076,25.10000038,20.5,1.649999976,1.460000038,nan,nan,nan,nan,nan,nan,nan,nan,nan,1.25,0.75,0.07599999756,1.098999977,0.05999999866,0.6719999909,231,213,648,478,1,1,1,3622.5,88.5
1,1101,2,1,2,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,98.47000122,95.47000122,1480,1216,0,28.79999924,22.10000038,22.39999962,26.70000076,1.649999976,1.419999957,nan,nan,nan,nan,nan,nan,nan,nan,nan,1.5,0.5,0.07599999756,1.098999977,0.05999999866,0.6719999909,218,214,611,497,1,1,1,3622.5,88.5
1,1101,3,1,3,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,98.63999939,95.83999634,1444,1240,0,27.10000038,23.39999962,22.20000076,27.29999924,1.629999995,1.429999948,nan,nan,nan,nan,nan,nan,nan,nan,nan,2.5,1.5,0.07599999756,1.098999977,0.05999999866,0.6719999909,215,216,597,500,1,1,1,3622.5,88.5
1,1101,4,1,4,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,98.83999634,96.29000092,1432,1244,0,26.60000038,22.39999962,22.5,28.5,1.639999986,1.419999957,nan,nan,nan,nan,nan,nan,nan,nan,nan,1.25,0.75,0.07599999756,1.098999977,0.05999999866,0.6719999909,213,215,592,501,1,1,1,3622.5,88.5
1,1101,5,1,5,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,98.80000305,96.19999695,1417,1234,0,26.20000076,23.29999924,23,27.5,1.639999986,1.440000057,nan,nan,nan,nan,nan,nan,nan,nan,nan,1,0.5,0.07599999756,1.098999977,0.05999999866,0.6719999909,212,213,590,498,1,1,1,3622.5,88.5
1,1101,6,1,6,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,99.08999634,96.48000336,1420,1226,0,26.60000038,23.39999962,23,27.10000038,1.639999986,1.450000048,nan,nan,nan,nan,nan,nan,nan,nan,nan,2,1,0.07599999756,1.098999977,0.05999999866,0.6719999909,212,212,584,497,1,1,1,3622.5,88.5
1,1101,7,1,7,2961.300049,2487,4091.899902,3436.600098,84,nan,0.06499999762,0.0719999969,nan,99.02999878,96.43000031,1411,1224,0,27.20000076,22.89999962,23.39999962,26.39999962,1.639999986,1.429999948,nan,nan,nan,nan,nan,nan,nan,nan,nan,1.5,0.5,0.07599999756,1.098999977,0.05999999866,0.6719999909,212,212,582,495,1,1,1,3622.5,88.5

It seems to me like there's more data columns than headers.

Any idea's whats going wrong? We're using the latest version of the interop executables, compiled and installed by conda.

Thanks Matthias

ezralanglois commented 3 years ago

HI Matthias, You need to split by , and by ';'. For example, P90<RED;GREEN> is two columns, not one.

matthdsm commented 3 years ago

Hi Robert, Thanks for the reply. So you mean the header has other delimiters than the data? Isn't that confusing? I'd get it if the corresponding data was also ; delimited, but I think this creates a weird situation.

Matthias

sklages commented 3 years ago

@matthdsm .. this is only in the header ..

@ezralanglois yes, that's what I do when reading the output into as dataframe. But this leads to non-unique columns:

% Base<A;C;G;T>
Corrected<A;C;G;T>
Called<A;C;G;T>

.. all result in columns like C, G and T>.

As a feature request: can this be changed to %A,%C,%G,%T and corA,corC,corG;corT and calA,calC,calG;calT or something similar?

ezralanglois commented 3 years ago

@sklages We could. Is there any reason why you want to parse this output rather than use the Python binding directly?

Originally, we had it like so: % Base A, % Base C, % Base G, T Base T. We put the <> around it so you would know whether it contained a sub-header, rather than use like like if c.startswith('% Base'): get_sub_column(c), or try to find some arbitrary shared header.

As a workaround for now, you can do the following:

def parse_header(col):
  if(col.find('<') == -1: return col
  header, subheader = col.split('<')
  # Here you can denote a header/subheader combination as you want, I used a '/'
  return [col+"/"+h.replace('>', '') for h in subheader.split(';')]

with open(filename, 'r') as fin: header = [parse_header(c) for c in fin.readline().split(',')]

Then with pandas you can just skip the first line and pass in the columns you need.

We can leave this issue open as a feature request if you really need a new header. We really put this in for people that use languages that are not covered, e.g. Matlab or Perl.

sklages commented 3 years ago

@ezralanglois Thanks for your quick and informative response.

I used imaging_table and summary with Perl - before that I parsed the InterOp files directly, also in Perl :-)

Now I am getting myself familar with Python .. well the docs of the python bindings are - umm - quite "complicated", hard to read / work with for someone not being an expert on that (yet). Some things are not possible - out of the box - see #256 ..

I read the output of imaging_table directly from python, fix some column names and create a pandas dataframe with only a dozen of columns I am interested in (via io.StringIO). That works quite fine and is not much code.

I just create two or three plots from that data, including %PF vs %Occ for NovaSeq runs.

For other parts I will definitely try/use the python bindings..

Just my 2p, but it took me some time to find the ; in ~50 columns of tabular data as an explanation for my initial problems parsing the output (I simply didn't expect that). Dealing with that always requires extra work ("fixing" the headers). My suggestion would be, to provide a "real" comma-separated header to create a standard CSV output without surprises ;-)

Nevertheless ... nice piece of software, thanks for that :-)

matthdsm commented 3 years ago

Solved using python API, thanks!