eyurtsev / fcsparser

A python parser for reading fcs files supporting FCS 2.0, 3.0, 3.1
MIT License
73 stars 43 forks source link

Unusual multi-FCS files #24

Open photocyte opened 3 years ago

photocyte commented 3 years ago

Hi there,

I've come across FCS files (From the Luminex Muse), which implement multi-FCS by simple concatenating single FCS files together. This was my solution to split them:

files = glob.glob("ADM_*.VIA.FCS")
for f in files:
    handle = open(f,"rb")
    data = handle.read()
    ##Some FCS files are just literal concatenations of single FCS files, this splits them.
    split_data = data.split(b"FCS3.0")
    for s in range(1,len(split_data)):
        handle = open(f+"_"+str(s)+".FCS","wb")
        handle.write(b"FCS3.0"+split_data[s])
        handle.close()

Once these multi-FCS files are split, fcsparser works perfectly, as far as I can tell. But it might be nice for the library to be able to detect these files by default! See attached for an example FCS: ADM_09SEP2020_181310.VIA.FCS.zip

maaikesangster commented 2 years ago

Hello,

I had the same issue, thank you for this solution! I am using the cytoflow package for my parsing and analysis of FC data and I wanted to raise the issue with them as well. Do you mind if I use your example?

photocyte commented 2 years ago

@maaikesangster Please feel free

bpteague commented 2 years ago

Do note that fcsparser supports choosing which dataset in the file to parse out. You can use the data_set keyword argument to the FCSParser constructor. It's 0-indexed -- so data_set = 0 is the first data set, data_set = 1 is the second, etc.

bpteague commented 2 years ago

(And @maaikesangster , cytoflow exposes the same functionality in ImportOp)

photocyte commented 2 years ago

For me, for a file with 4x concatenated FCS files, this works for data_set=0 and data_set=1 , but for data_set=2 & data_set=3, it fails:

meta , data = fcsparser.parse(f,data_set=2)
Encountered an illegal utf-8 byte in the header.
 Illegal utf-8 characters will be ignored.
'utf-8' codec can't decode byte 0x8c in position 0: invalid start byte
20220112_files/ADM_12JAN2022_112816.VIA.FCS

All 4 files can be opened successfully when first separated via this approach (https://github.com/eyurtsev/fcsparser/issues/24#issue-697426922) . Happy to share all 5 files (original + 4 split) if desired.

edit: here is the full error message

~/miniconda3/lib/python3.9/site-packages/fcsparser/api.py in parse(path, meta_data_only, compensate, channel_naming, reformat_meta, data_set, dtype)
    538     read_data = not meta_data_only
    539 
--> 540     fcs_parser = FCSParser(path, read_data=read_data, channel_naming=channel_naming,
    541                            data_set=data_set)
    542 

~/miniconda3/lib/python3.9/site-packages/fcsparser/api.py in __init__(self, path, read_data, channel_naming, data_set)
    105         if path:
    106             with open(path, 'rb') as f:
--> 107                 self.load_file(f, data_set=data_set, read_data=read_data)
    108 
    109     def load_file(self, file_handle, data_set=0, read_data=True):

~/miniconda3/lib/python3.9/site-packages/fcsparser/api.py in load_file(self, file_handle, data_set, read_data)
    117         while data_segments <= data_set:
    118             self.read_header(file_handle, nextdata_offset)
--> 119             self.read_text(file_handle)
    120             if '$NEXTDATA' in self.annotation:
    121                 data_segments += 1

~/miniconda3/lib/python3.9/site-packages/fcsparser/api.py in read_text(self, file_handle)
    215         #####
    216         # Parse the TEXT segment of the FCS file into a python dictionary
--> 217         delimiter = raw_text[0]
    218 
    219         if raw_text[-1] != delimiter:

IndexError: string index out of range

It seems data_set is looking to split on the string $NEXTDATA, whereas the example FCS file I've uploaded are just whole separate files that are concatenated, so they are instead separated by the FCS start bytes FCS3.0 .

bpteague commented 2 years ago

@photocyte I'd love to add it to my collection of weird FCS files (: And if I can figure out the fix, I'll submit a pull request to @eyurtsev .

photocyte commented 2 years ago

Thanks @bpteague ! See linked zip file below. That has the _1,_2,_3,_4 split off FCS files, plus the original FCS file ADM_12JAN2022_112816.VIA.FCS.

20220112_files.zip

I also realized I previously uploaded a file here (https://github.com/eyurtsev/fcsparser/issues/24#issue-697426922) that should have the same phenomena, but maybe it isn't already split out.

bpteague commented 2 years ago

@photocyte Thanks for the file. I found the problem, and the fix is easy. In fcsparser.api, on line 125, replace

nextdata_offset = self.annotation['$NEXTDATA']

with

nextdata_offset += self.annotation['$NEXTDATA']

@eyurtsev, I'll put together a test case and a PR.