chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
76 stars 20 forks source link

Add support for custom encoders #1069

Open ridvan-eksi opened 4 months ago

ridvan-eksi commented 4 months ago

Current version creates a label encoder by default for each string column queried. This PR extends this capability to provide custom encoders for columns including non-string types.

atolopko-czi commented 4 months ago

@ridvan-eksi Thanks for the contribution! Can you tell us a little more about how you're using the custom encoders for non-string columns? In case these non-numeric encoders might be useful to others, by default.

ridvan-eksi commented 4 months ago

@atolopko-czi Main use case for users providing encoders would be to use fixed mappings across different datapipe instances. In the current version, a new encoder is generated if the data release or query changes, but we would like to use consistent mappings even if the data or query expands.

A second use case would be using encoders other than one hot encoding for string columns, like ordinal encoder. Ordinal encoding is required for using learnable embeddings.

A third, and maybe a more hacky use case for these custom encoders would be allowing more complex functions rather than encoders to process non-string/string columns. This would be similar to transform concept in torchvision datasets and datapipes. For example, we might have a function to parse/extract some information from development_stage column. We can do that by providing the parse function as a custom encoder.

I hope it's clear. As far as I can see, these encoder class should have a transform and inverse_transform methods which should be implemented in all custom encoders. (even though in some cases inverse_transform may not be possible)

github-actions[bot] commented 3 months ago

This PR has not seen any activity in the past 4 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.

github-actions[bot] commented 3 months ago

This PR was closed because it has been inactive for 31 days, 3 days since being marked as stale. Please re-open if you still need this to be addressed.

atolopko-czi commented 3 months ago

Reviving PR; we still intend to review and incorporate this change.

pablo-gar commented 3 months ago

Thanks for PR @ridvan-eksi. All of those use cases are valuable. We'd like to incorporate the changes, and it looks like some error checking may be needed and along some unit tests.

For the unit test, we'd welcome any suggestions you may have, otherwise we'll discuss this internally.

ivirshup commented 3 months ago

@ridvan-eksi, if you're still interested in seeing this merged, would you mind merging main into this branch? Then we'll be able to see all the CI checks run.

It'd also be good to see a test for this, since at the moment it looks to me like any custom encoders passed in here would be overwritten during initialization.

ridvan-eksi commented 2 months ago

Hi @ivirshup. I will merge main into this branch, and I will add a test.

If I am not mistaken: If user provides a custom encoder for a specific column, that encoder is used for that column. For the remaining columns, the default encoder is used.

github-actions[bot] commented 3 hours ago

This PR has not seen any activity in the past 4 weeks; if no one comments or reviews it in the next 3 days, this PR will be closed.