Closed sjfleming closed 1 month ago
There is a Filter transform but like you say it might be too late if you worry about memory/speed. Yes, then the other place to add filtering before data got moved to gpu would be convert_fn
. For that we probably need to figure out how to provide the list of subset genes.
@ordabayevy These two commits provide a (working) rough idea. But I don't know if the changes to core data.py
are clean enough. It could be improved.
https://github.com/cellarium-ai/cellarium-ml/commit/e21e7d9ddb2abe1055199c43acc0848168b144c7 https://github.com/cellarium-ai/cellarium-ml/commit/a7ed6c78cec520f03ed0bc7a0f600109ea3ecda9
In particular this config file https://github.com/cellarium-ai/cellarium-ml/commit/a7ed6c78cec520f03ed0bc7a0f600109ea3ecda9
is ugly because of the repetition that would (seemingly?) be needed for X and var_names_g. Hopefully we can find a more elegant way to handle it. The referenced csv file just contains a list of gene names delimited by \n with no header.
I think adding convert_fn_kwargs
is a nice extension! But I agree that additional logic in data.py
is not very elegant. It almost feels like there might be a need for additional transforms before the data is moved to device 🤔 . Here are multiple ideas:
Filter
transform? It should be a constant overhead and if the batch size not too big it might be unnoticeable.CellariumModule
and use it somehow. For example (just an idea), maybe we can mark some transforms (Filter
in this case) to act inside the on_before_batch_transfer
hook on CPU.Filter
transform. I haven't tried.@ImXman todo:
on_before_batch_transfer()
method here (https://github.com/cellarium-ai/cellarium-ml/blob/main/cellarium/ml/core/module.py) and somehow we would need to be able to tell (based on the config file, and also for instantiation of CellariumPipeline in general -- https://cellarium-ai.github.io/cellarium-ml/core.html#cellarium.ml.core.CellariumPipeline) which transforms occur before transfer to GPU and which occur after. This will require some careful though and some additional code.This should be closed by #223
But let's wait and actually test it @ImXman
Closed by #223
We want a way to on-the-fly subset to a specific list of genes.
We do not want the dataloader to load the entire set of genes because this would be a big waste of memory / cuda memory.
Should this be a transform? It seems like a transform happens too late... I think we'd really like this to be part of the dataloader itself. Any thoughts @ordabayevy ?
What if we just wrote a new
convert_fn
forX
to be used here in the config file?