chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
78 stars 20 forks source link

[builder] CxG schema 5 / Census schema 2 #1024

Closed bkmartinjr closed 5 months ago

bkmartinjr commented 6 months ago

Builder support for Census schema 2.0.0 / CELLxGENE schema 5.0.0

Fixes: #993 Fixes: #1022 Fixes: #796

Primary changes:

codecov[bot] commented 6 months ago

Codecov Report

Attention: Patch coverage is 72.34043% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 81.33%. Comparing base (06ee454) to head (bcb0814). Report is 3 commits behind head on main.

Files Patch % Lines
...llxgene_census_builder/build_soma/validate_soma.py 0.00% 8 Missing :warning:
...ne_census_builder/build_soma/experiment_builder.py 30.00% 7 Missing :warning:
...src/cellxgene_census_builder/build_soma/anndata.py 75.00% 4 Missing :warning:
...lder/src/cellxgene_census_builder/build_soma/mp.py 50.00% 4 Missing :warning:
.../cellxgene_census_builder/build_soma/build_soma.py 89.28% 3 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1024 +/- ## ======================================= Coverage 81.32% 81.33% ======================================= Files 73 73 Lines 5553 5566 +13 ======================================= + Hits 4516 4527 +11 - Misses 1037 1039 +2 ``` | [Flag](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1024/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/chanzuckerberg/cellxgene-census/pull/1024/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg) | `81.33% <72.34%> (+<0.01%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=chanzuckerberg#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

bkmartinjr commented 6 months ago

@pablo-gar - thanks for the assay term changes.

Question/open issue: the Census schema also defines special handling for the normalized layer, for any "Smart-Seq" assays. Can we add the definitive list of which EFO terms are considered a Smart-Seq assay? Ideally this would be included in the Census schema (or referenced as you did with the assay filter terms).

prathapsridharan commented 5 months ago

@bkmartinjr - In regards to issue #993 there is this point:

Confirm no use of CL:0000003, aka native cell. These cells will now be marked as unknown in both obs.cell_type and obs.cell_type_ontology_term_id columns. See related task: https://github.com/chanzuckerberg/cellxgene-census/issues/1019

Once a test build is produced, are we to check this manually in the python interpreter by doing a count of CL:0000003 in obs.cell_type_ontology_term_id and native cell in obs.cell_type and checking that the count is zero?

Or should this be captured in some type of of post build acceptance test where some sanity checks about the data are done?

bkmartinjr commented 5 months ago

@prathapsridharan re:

are we to check

This is entirely an upstream issue in the DP process, and the builder does not enforce (or check) for this level of metadata compliance with the CxG schema. These checks for compliance with the CxG schema are the provenance of the schema validation toolkit used by Lattice, et al.

We could do this kind of checks, but it is redundant, and adds linkages across layers in the system that don't add much value IMHO.

bkmartinjr commented 5 months ago

@atarashansky - we have included your requested organisms info (#796) in Census schema 2.0.0. Current content will be:

In [4]: census['census_info']['organisms'].read().concat().to_pandas().set_index('soma_joinid')
Out[4]: 
            organism_ontology_term_id organism_label      organism
soma_joinid                                                       
0                      NCBITaxon:9606   Homo sapiens  homo_sapiens
1                     NCBITaxon:10090   Mus musculus  mus_musculus

Please feel free to leave comments on both the Schema MD file changes and the code.

pablo-gar commented 5 months ago

Light QC on test build shows no issues

Checks

See notebook https://colab.research.google.com/drive/1kb2ZR0MPxVsWBgJlIgrtJrkGPx5wqxIk