chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 22 forks source link

Add unit tests for PyTorch classes #477

Open atolopko-czi opened 1 year ago

atolopko-czi commented 1 year ago

Census PyTorch ExperimentDataPipe unit test coverage is already substantial. However, there are a few stubbed tests that could implemented.

In addition add tests for:

atolopko-czi commented 1 year ago

Added unit tests for various batching cases, sparse/dense output, and encoders

prathapsridharan commented 4 months ago

@pablo-gar (cc: @atolopko-czi @ebezzi ) - Is this intentionally left open to serve as a placeholder for unit test tasks for pytorch? I ask because the ticket is was created in May 2023.

If so, should we create tickets for specific test cases for pytorch? For example, we don't have any testing for the scatter-gather-shuffle algorithm - It is a non-deterministic algorithm and testing it needs some thought. Should we make a specific ticket for this case and link it to this ticket?

atolopko-czi commented 4 months ago

This is definitely an old ticket!

There are two stubbed/ignored tests that are intended to test coverage for https://github.com/chanzuckerberg/cellxgene-census/blob/96c0499cd83521d169220f37555724bc9c67f114/api/python/cellxgene_census/src/cellxgene_census/experimental/ml/pytorch.py#L547-L555. But honestly, we need to eliminate the multi-worker support entirely (it's redundant w/TileDB threaded I/O), so IMO those test stubs can be deleted.

Regarding the other two test suggestions, above:

 The obs / X data is joined correctly (X rows correspond to their respective obs soma_joinids)

This appears to be implicitly tested by various methods such as test_batching__all_batches_full_size, test_batching__exactly_one_batch, and test_experiment_dataloader__batched, where both the obs and X contents of a batch are asserted in tandem. Could still be worth adding a test that is named appropriately to explicitly capture the requirement, e.g. test_obs_and_X_are aligned.

 The var / X data joined correctly (gene ordering matches var query filter)

Similarly, this is implicitly tested by tests that assert the ordered values within a row of X, but I would add an explicit test for this. And particularly for sparse tensors, I would test that the expected data at a given var coordinate is correct.