Weld errored at aggregation

gwaybio commented 3 years ago

We've seen this error before, it has to do with the machine not being large enough to write out the full single cell file per plate.

Here is the error where halted:

Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well2-25...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well1-53...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well6-21...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well1-17...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well6-3...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well2-61...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well2-7...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well3-33...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well5-4...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well5-31...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well2-43...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well1-71...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well1-8...
Building single file for dataset ALLBATCHES___ALLPLATES___ALLWELLS; combining single cells from site: CP257F-Well3-15...
Traceback (most recent call last):
  File "recipe/1.generate-profiles/1.aggregate.py", line 77, in <module>
    ), "Error! The single cell file does not exist! Check 0.merge-single-cells.py"
AssertionError: Error! The single cell file does not exist! Check 0.merge-single-cells.py
Now normalizing gene...with operation: standardize for spilt ALLBATCHES___ALLPLATES___ALLWELLS
Traceback (most recent call last):
  File "recipe/1.generate-profiles/2.normalize.py", line 90, in <module>
    df = read_csvs_with_chunksize(file_to_normalize)
  File "/home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/CP257-HeLa-WG/recipe/scripts/io_utils.py", line 28, in read_csvs_with_chunksize
    with pd.read_csv(filename, chunksize=chunksize, **kwargs) as reader:
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1867, in __init__
    self._open_handles(src, kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1368, in _open_handles
    storage_options=kwds.get("storage_options", None),
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/common.py", line 594, in get_handle
    **compression_args,
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'data/1.profiles/20210422_6W_CP257/profiles/20210422_6W_CP257_gene_ALLBATCHES___ALLPLATES___ALLWELLS.csv.gz'
Now performing feature selection for gene...with operations: ['variance_threshold', 'correlation_threshold', 'drop_na_columns', 'blocklist', 'drop_outliers'] for spilt ALLBATCHES___ALLPLATES___ALLWELLS
Traceback (most recent call last):
  File "recipe/1.generate-profiles/3.feature-select.py", line 93, in <module>
    df = read_csvs_with_chunksize(file_to_feature_select)
  File "/home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/CP257-HeLa-WG/recipe/scripts/io_utils.py", line 28, in read_csvs_with_chunksize
    with pd.read_csv(filename, chunksize=chunksize, **kwargs) as reader:
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1867, in __init__
    self._open_handles(src, kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1368, in _open_handles
    storage_options=kwds.get("storage_options", None),
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/common.py", line 594, in get_handle
    **compression_args,
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/gzip.py", line 163, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: 'data/1.profiles/20210422_6W_CP257/profiles/20210422_6W_CP257_gene_normalized_ALLBATCHES___ALLPLATES___ALLWELLS.csv.gz'

gwaybio commented 3 years ago

I've overcome this issue by generating plate-level profiles, however, a new error appeared:

Now normalizing gene...with operation: standardize for spilt ALLBATCHES___CP257A___ALLWELLS
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/sklearn/utils/extmath.py:847: RuntimeWarning: invalid value encountered in true_divide
  updated_mean = (last_sum + new_sum) / updated_sample_count
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/sklearn/utils/extmath.py:689: RuntimeWarning: Degrees of freedom <= 0 for slice.
  result = op(x, *args, **kwargs)
Now normalizing guide...with operation: standardize for spilt ALLBATCHES___CP257A___ALLWELLS
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/sklearn/utils/extmath.py:847: RuntimeWarning: invalid value encountered in true_divide
  updated_mean = (last_sum + new_sum) / updated_sample_count
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/sklearn/utils/extmath.py:689: RuntimeWarning: Degrees of freedom <= 0 for slice.
  result = op(x, *args, **kwargs)
Now normalizing single_cell...with operation: standardize for spilt ALLBATCHES___CP257A___ALLWELLS
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/sklearn/utils/extmath.py:847: RuntimeWarning: invalid value encountered in true_divide
  updated_mean = (last_sum + new_sum) / updated_sample_count
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/sklearn/utils/extmath.py:689: RuntimeWarning: Degrees of freedom <= 0 for slice.
  result = op(x, *args, **kwargs)

MerajRamezani commented 3 years ago

@gwaygenomics I ran the normalization and feature selection but focusing only on the plate level profiles after aggregation. After a successful test on my local computer I ended up running the last 2 steps of recipe (Normalization, and feature selection) for the rest of the files on AWS by removing "- single_cell" from levels in the options.yaml config file.

ErinWeisbart commented 2 years ago

@gwaygenomics Am I understanding

I've overcome this issue by generating plate-level profiles

to say that you were never able to run 1./1.aggregate without splitting the data further because it takes too much memory?

If I need to make the same split while processing a different batch I change config/experiment.yaml to the following?

  split:
    qc:
      batches: false
      plates: false
      wells: false
    profile:
      batches: false
      plates: true
      wells: false

And then @MerajRamezani you're saying that after aggregation by plate, 1./2.normalization and 1./3.feature-selection you set config/options.yaml to

    levels:
      - gene
      - guide

You're saying that was necessary to avoid the error Greg mentioned above?

ErinWeisbart commented 2 years ago

@gwaygenomics @MerajRamezani can you take a look at this so I can get un-stuck? Thanks!

MerajRamezani commented 2 years ago

@ErinWeisbart My understanding was that the weld process was failing when it was handling the normalization of single cell profiles. My guess is normalizing profiles from all cells in a plate might have overloaded the memory. It is useful to have single-cell profiles normalized at the plate level but it is not essential to start with. So basically I took the single-cell profiles in one csv.gz (at plate level) aggregated them at both gene/guide levels followed by Normalization at plate level and feature selection .

broadinstitute / CP257-HeLa-WG

Weld errored at aggregation #1