broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

Skip corrupted site files #79

Closed gwaybio closed 2 years ago

gwaybio commented 3 years ago

In a recent run, we observed the following error:

Now processing spots for XXXX-Well2-15...part of set ALLBATCHES___ALLPLATES___ALLWELLS
Now processing spots for XXXX-Well2-16...part of set ALLBATCHES___ALLPLATES___ALLWELLS
Traceback (most recent call last):
  File "recipe/0.preprocess-sites/1.process-spots.py", line 156, in <module>
    foci_df = pd.read_csv(foci_file)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 2036, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1011 fields in line 3, saw 1021

We need to add code to skip these sites, instead of completely erroring out.

ErinWeisbart commented 2 years ago

I'm replicating that run and it proceeded past the site that errored for Greg without a problem but instead errored ~200 sites later. This suggests to me that perhaps it's a spontaneous error and that upon error we should first try re-downloading/parsing the Image.csv once and if it fails a second time skip the site.

If we are going to skip sites, I suggest we set a number of sites that we allow skipping before triggering an error out because it would be a bummer if the whole weld proceeded but only half the sites actually made it through processing.