Handling for invalid UTF8 characters in headers

natl commented 7 years ago

I'm working with FCS2 files from a CellQuest Pro 6 cytometer, and I've found that the machine puts the non-UTF8 character \xaa in the header file next to the machine name. I get around this by removing the non-unicode character before reading the file with FCSparser, but would you be open to changing line 166 in api.py from

 raw_text = raw_text.decode('utf-8')

to

 raw_text = raw_text.decode('utf-8', errors='ignore')

Let me know if you want me to submit a pull request for the change? -Nathanael

eyurtsev commented 7 years ago

A PR would be appreciated. :)

Maybe do the following:

try: 
  raw_text = raw_text.decode('utf-8')
except UnicodeDecodeError as e:  #6 (or whatever it's called)
  raw_text = raw_text.decode('utf-8', errors='ignore')
  warning.warn(warning message + some information from e)

natl commented 7 years ago

Good point. I'll have a PR for you in a day or two. Thanks.

eyurtsev commented 7 years ago

Thanks! sorry for the delay in merging

eyurtsev / fcsparser

Handling for invalid UTF8 characters in headers #7