Update Migration and Processing Jobs to Provision Resources based on Dataset Requirements

nayib-jose-gloria commented 1 year ago

Currently, we provision the same resources to all jobs of the same type. However, some datasets are smaller and can get by equally well with smaller machines. Allocate resources dynamically, based on dataset size--either fully parameterized if feasible or into several representative 'buckets' (i.e. jobs are assigned to one of Machines 1-5 based on dataset size). This will make migrations faster (room for more concurrent jobs handled by the compute env) and cheaper (smaller machines provisioned for many jobs)

Bento007 commented 1 year ago

AWS SFN does not let you modify VCPU or MEMORY using a variable in the SFN. However the JobDefinition in the SFN can be parameterized. Each job definition can be tailored to process a dataset of a specific size. Whichever step determine the resource that should be used, should return the JobDefinition ARN to the next step. This will cause the next step to use that job definition to process.

To parameterize the JobDefinition from the previous jobs result in the step function definition:

{
  "States": {
    "StepName": {
      "Parameters": {
        "JobDefinition.$": "$.result.job_definition"
      }
   }
}

Bento007 commented 1 year ago

we are using using h5ad.write using gzip with the default compressions value of 4. There is no surefire way of knowing the uncompressed size without decompressing the data. I found this command gunzip -c file.gz | wc --bytes in this forum.

this will uncompress the file but not store the results, instead passing them to wc which will count the number of bytes as they pass then discard them.

Something like this can be used to determine the size of the uncompressed file.

Bento007 commented 1 year ago

Since the h5ad file is comprised of HDF5 files we can use the using the attribute nbytes to get the size of X and raw.X which are the larges chunks of data in the dataset.

Bento007 commented 1 year ago

There are two different format of X that need to be supported the SparseDataset and Dataset format. The Dataset format can use nbytes, but SparseDataset doesn't have that value. For SparseDataset you can add up all the three arrays that make up the sparse dataset to get an idea of the size.

import anndata as ad

def estimated_memory_usage(adata: ad.AnnData) -> int:
    """
    Estimate the memory usage of an AnnData object in bytes.
    """

    size = adata.n_obs * adata.n_vars // 2 ** 20
    if isinstance(ad.X, ad.core.sparse_dataset.SparseDataset):
        size = sum([ad.X.group[key].nbytes for key in ad.X.group.keys()])
    if isinstance(ad.X, h5py.Dataset)
        size = max([ad.X.nbytes, ad.raw.X.nbytes])
    return size

In general this still doesn't give us an accurate idea of how much memory is needed. To verify how much memory is actually used AnnData.to_memory() was run. I then checked how much memory the process consumed. The experiment and measured memory usage was off by half.

A closer measurement of memory needed was achieved by using adata.n_obs * adata.n_vars. This method is not as refined and likely overshoots, but it's better to over estimate than under.

Bento007 commented 1 year ago

adata.__sizeof__() also works, but requires loading the dataset into memory.

chanzuckerberg / single-cell

Update Migration and Processing Jobs to Provision Resources based on Dataset Requirements #543