chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 22 forks source link

Drop invalid cube values in memento builder #940

Open atolopko-czi opened 9 months ago

atolopko-czi commented 9 months ago

As of 8f3ed5fa0fc05dad381adda79e2cf502fe9e43bc, the diff expr method drops cube values where (estimators_df["sem"] <= 0) | (estimators_df["sem"] >= estimators_df["mean"]). This pruning prevents mathematical errors when transforming the estimators into log space by avoiding taking log(mean) whenmean <=0and whenlog(mean - sem)whenmean - sem <= 0`.

This drops ~1-3% of cube data, depending upon the query filter specified by the user.

The zero-valued sem cube elements constitute 1.7% of the cube, and are computed from n_obs counts ranging between 1-14:

count    1.875726e+07
mean     1.107691e+00
std      3.761035e-01
min      1.000000e+00
25%      1.000000e+00
50%      1.000000e+00
75%      1.000000e+00
max      1.400000e+01

The sem>mean cube elements constitute 0.1% of the cube, and are computed from n_obs counts ranging primarily between 1-5:

count    1.061575e+06
mean     6.328915e+00
std      1.474041e+01
min      2.000000e+00
25%      2.000000e+00
50%      3.000000e+00
75%      5.000000e+00
max      2.726000e+03

Since these estimator values are computed from low counts of raw expression values, it has been deemed acceptable to drop these values entirely: