chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
88 stars 22 forks source link

Experiment Data Pipe doesn't return soma_joinids #1289

Closed yanwu2014 closed 1 month ago

yanwu2014 commented 2 months ago

The ExperimentDataPipe no longer returns soma_joinids by default if nothing is specified in the obs_column_names parameter

Specifically:

datapipe = ml.pytorch.ExperimentDataPipe(
            experiment=my_soma_experiment,
            measurement_name='RNA',
            obs_query=obs_query,
            obs_column_names=[], ## No obs columns specified
)
tensor, soma_ids = next(iter(datapipe))

I expected soma_ids to be a tensor of obs indexes but instead it's an empty tensor. It seems like it's due to a potential bug in the _ObsAndXIterator class on line 340 in pytorch.py where: obs["soma_joinid"] = obs.index should instead be obs_encoded["soma_joinid"] = obs.index since obs_encoded is what gets turned into obs_tensor and returned

Happy to put together a PR if this seems like the issue

ivirshup commented 1 month ago

Thanks for the bug report @yanwu2014!

This change was actually intentional. We figured that it would be easier to work with the loader if there was less implicit behavior, especially when that implicit behavior ends up complicating the offsets you need to access the encoded variables. Now if you want the soma join ids you can access them by explicitly requesting them.

yanwu2014 commented 1 month ago

Ah I see, thanks for clarifying! Just a heads up I think the docstrings still say that the soma join ids will always be returned but I'll close this issue for now