eyurtsev / fcsparser

A python parser for reading fcs files supporting FCS 2.0, 3.0, 3.1
MIT License
73 stars 43 forks source link

Allow $NEXTDATA to be $ENDDATA+1 #2

Closed meihuisu closed 7 years ago

meihuisu commented 8 years ago

FCS files generated with Millipore have $NEXTDATA equals to $ENDDATA+1 instead of 0s

eyurtsev commented 8 years ago

PR to address issue: https://github.com/eyurtsev/fcsparser/issues/1

the specification (http://murphylab.web.cmu.edu/FCSAPI/FCS3.html) indicates that there should be a second data set that starts at position $NEXTDATA

the PR doesn't handle the 2nd data set.. do you know whether this is OK? (what's the 2nd data set encoding?) the best solution would be to try and parse that dataset as well -- although maybe not worth the time?

However, at the very least a warning should be logged to let the user know that there's a 2nd data set that has been ignored by the parser (import logging, logger = logging.getLogger(name), logger.warning(' '))

eyurtsev commented 8 years ago

could you attach an example fcs file with the $NEXTDATA variable ? :)

meihuisu commented 8 years ago

The following dataset should be similar to the first dataset. I plan to introduce a dataset_start marker so when a file is open it will seek there before extracting

eyurtsev commented 8 years ago

this PR needs the logger before it's ready for merging

I'd like to avoid changes that allow the parser to fail completely silently

meihuisu commented 8 years ago

Actually, it should not be merged. I am waiting for a final FCS test file before finalize the new code I am working on. I think the best solution is to make a branch from the master and then add this change and the new handling of multiple datasets FCS file as a separate entity. This pull request would make master fcsreader incorrect in term of FCS spec. What do you think?

----- Original Message ----- From: "Eugene Yurtsev" notifications@github.com To: "eyurtsev/fcsparser" fcsparser@noreply.github.com Cc: "meihuisu" mei@isi.edu Sent: Thursday, February 25, 2016 11:01:32 AM Subject: Re: [fcsparser] Allow $NEXTDATA to be $ENDDATA+1 (#2)

this PR needs the logger before it's ready for merging

I'd like to avoid changes that allow the parser to fail completely silently


Reply to this email directly or view it on GitHub: https://github.com/eyurtsev/fcsparser/pull/2#issuecomment-188931459

meihuisu commented 8 years ago

It looks like the multi-dataset FCS file I have received is more like a concatenation of multiple single dataset FCS files.

eyurtsev commented 8 years ago

I am not sure if I understand the reason for a 2nd branch. If the change has to break backwards compatibility, we could just add another function that handles reading of multiple fcs data sets.

(e.g., parse_multi_fcs or something along these lines)

although if the multi-dataset is a concatenation of multiple single datasets perhaps a single dataset should be returned?

meihuisu commented 8 years ago

What I have to work with is a multi-dataset FCS (almost) file that has incorrect header data and NEXTDATA entry because it is essentially many single dataset FCS (almost again) that got concatenated together. In addition, several of their numerical meta data values are improperly padded with space instead of '0' as specified in the 3.1 FCS standard.

In order to work with this data, I have altered fcsreader's NEXTDATA check as proposed by the pull request and also added a new variable, dataset_start, that tracks the start of next dataset. With these changes, I am able to call fcsreader repeatly and retrieve one dataset at a time.

In addition to parsing this master datafile, I am tasked to generate proper single dataset FCS files to be used for future reference. This includes going into the text content and fix all those improper padding and then write out various pieces from fcsreader's inner. fcsreader is very well written and so it was simple to trace and insert new code to do the writeout with minimum impact on the original code.

I have all these in working order with some final cleanup to be done but I think I have deviate too far to fold it in. Perhaps just the NEXTDATA and dataset_start ?

----- Original Message ----- From: "Eugene Yurtsev" notifications@github.com To: "eyurtsev/fcsparser" fcsparser@noreply.github.com Cc: "meihuisu" mei@isi.edu Sent: Wednesday, March 2, 2016 6:09:01 AM Subject: Re: [fcsparser] Allow $NEXTDATA to be $ENDDATA+1 (#2)

I am not sure if I understand the reason for a 2nd branch. If the change has to break backwards compatibility, we could just add another function that handles reading of multiple fcs data sets.

(e.g., parse_multi_fcs or something along these lines)

although if the multi-dataset is a concatenation of multiple single datasets perhaps a single dataset should be returned?


Reply to this email directly or view it on GitHub: https://github.com/eyurtsev/fcsparser/pull/2#issuecomment-191250307

eyurtsev commented 8 years ago

i'm not sure if I understand the logic exactly, so it'll be helpful to see some version of the full code

calling parse_fcs repeatedly should always yield the same result (a function should be stateless)

meihuisu commented 8 years ago

This is how the fcsreader is called to parse the multiple dataset FCS(almost) I have,

!/usr/bin/env python

import os import sys

import fcsparser path='usc4/exp_021916kv.EP5.FCS'

def get_one_fcs(track_nextdata) : meta, data = fcsparser.parse(path, dataset_start=track_nextdata, reformat_meta=False) print "sampleID:",meta['GTI$SAMPLEID']

if (meta['$NEXTDATA'] == 0): return -1; else: return track_nextdata+ meta['$NEXTDATA']

############ main ############################ file_size = os.path.getsize(path)

print "calling to extract just the first one.." get_one_fcs(0)

print "calling to extract all of them.." n=0 while (n != -1 and n < file_size): n=get_one_fcs(n)

----- Original Message ----- From: "Eugene Yurtsev" notifications@github.com To: "eyurtsev/fcsparser" fcsparser@noreply.github.com Cc: "meihuisu" mei@isi.edu Sent: Tuesday, March 8, 2016 2:53:48 PM Subject: Re: [fcsparser] Allow $NEXTDATA to be $ENDDATA+1 (#2)

i'm not sure if I understand the logic exactly, so it'll be helpful to see some version of the full code

calling parse_fcs repeatedly should always yield the same result (a function should be stateless)


Reply to this email directly or view it on GitHub: https://github.com/eyurtsev/fcsparser/pull/2#issuecomment-194005509

meihuisu commented 8 years ago

I have converted the ParserFeatureNotImplementedError on $NEXTDATA exceptions to logging.

prubbens commented 8 years ago

Extracting the data from the Flowcap II challenge from the flowrepository gives the same error. How do I solve this?

eyurtsev commented 8 years ago

If you're interested, you can work on this PR to implement support for dataset_num in the API. Otherwise you try finding a way of extracting the data using a different fcs importer (maybe in R?) and re-exporting it into a better formatted fcs file(s).

eyurtsev commented 7 years ago

Any interested in continuing work on this?

meihuisu commented 7 years ago

I actually have a working version but unfortunately I also added a fcswriter to it. It turns out there are some fcs files that are generated properly by the instrument but does not conform to the fcs spec completely. I had to rewrite them into proper fcs files and hence the fcswriter part.

----- Original Message ----- From: "Eugene Yurtsev" notifications@github.com To: "eyurtsev/fcsparser" fcsparser@noreply.github.com Cc: "meihuisu" mei@isi.edu, "Author" author@noreply.github.com Sent: Monday, April 24, 2017 8:14:52 PM Subject: Re: [eyurtsev/fcsparser] Allow $NEXTDATA to be $ENDDATA+1 (#2)

Any interested in continuing work on this?

-- You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub: https://github.com/eyurtsev/fcsparser/pull/2#issuecomment-296882580

eyurtsev commented 7 years ago

k i'll close the pull request for now, but feel free to file a PR if you've managed to fix some of those issues or would like to add an fcs writer