Closed matthdsm closed 3 years ago
HI Matthias,
You need to split by ,
and by ';'. For example, P90<RED;GREEN>
is two columns, not one.
Hi Robert,
Thanks for the reply.
So you mean the header has other delimiters than the data? Isn't that confusing?
I'd get it if the corresponding data was also ;
delimited, but I think this creates a weird situation.
Matthias
@matthdsm .. this is only in the header ..
@ezralanglois yes, that's what I do when reading the output into as dataframe. But this leads to non-unique columns:
% Base<A;C;G;T>
Corrected<A;C;G;T>
Called<A;C;G;T>
.. all result in columns like C
, G
and T>
.
As a feature request: can this be changed to %A,%C,%G,%T
and corA,corC,corG;corT
and calA,calC,calG;calT
or something similar?
@sklages We could. Is there any reason why you want to parse this output rather than use the Python binding directly?
Originally, we had it like so: % Base A, % Base C, % Base G, T Base T
. We put the <>
around it so you would know whether it contained a sub-header, rather than use like like if c.startswith('% Base'): get_sub_column(c)
, or try to find some arbitrary shared header.
As a workaround for now, you can do the following:
def parse_header(col):
if(col.find('<') == -1: return col
header, subheader = col.split('<')
# Here you can denote a header/subheader combination as you want, I used a '/'
return [col+"/"+h.replace('>', '') for h in subheader.split(';')]
with open(filename, 'r') as fin: header = [parse_header(c) for c in fin.readline().split(',')]
Then with pandas you can just skip the first line and pass in the columns you need.
We can leave this issue open as a feature request if you really need a new header. We really put this in for people that use languages that are not covered, e.g. Matlab or Perl.
@ezralanglois Thanks for your quick and informative response.
I used imaging_table
and summary
with Perl - before that I parsed the InterOp files directly, also in Perl :-)
Now I am getting myself familar with Python .. well the docs of the python bindings are - umm - quite "complicated", hard to read / work with for someone not being an expert on that (yet). Some things are not possible - out of the box - see #256 ..
I read the output of imaging_table
directly from python, fix some column names and create a pandas dataframe with only a dozen of columns I am interested in (via io.StringIO
). That works quite fine and is not much code.
I just create two or three plots from that data, including %PF vs %Occ
for NovaSeq runs.
For other parts I will definitely try/use the python bindings..
Just my 2p, but it took me some time to find the ;
in ~50 columns of tabular data as an explanation for my initial problems parsing the output (I simply didn't expect that). Dealing with that always requires extra work ("fixing" the headers). My suggestion would be, to provide a "real" comma-separated header to create a standard CSV output without surprises ;-)
Nevertheless ... nice piece of software, thanks for that :-)
Solved using python API, thanks!
Hi,
I'm trying to parse the output from the imaging table application, but the output header fields don't seem to match the data.
example
It seems to me like there's more data columns than headers.
Any idea's whats going wrong? We're using the latest version of the interop executables, compiled and installed by conda.
Thanks Matthias