Mismatch between cell metadata and expression matrix

rwollman commented 1 year ago

In the 20230830 release, there is a mismatch in the number of cells between the expression matrix and metadata for the Allen MERFISH data. Metadata has 3938808 cells, and the expression matrix has 4334174 cells.

metadata was loaded with: rpath = metadata['cell_metadata']['files']['csv']['relative_path'] file = os.path.join( download_base, rpath) cell = pd.read_csv(file, dtype={"cell_label":str}) cell.shape

expression was loaded with: download_base = '/orangedata/ExternalData/Allen_WMB_2023Sep05' filename = expression_matrices['C57BL6J-638850']['raw']['files']['h5ad']['relative_path'] adata = anndata.read_h5ad(os.path.join(download_base,filename)) adata.shape

Both of these numbers are different than the number of cells in 20230630 where both datasets had the same number of cells at 4330907.

If the cell numbers are not the same, the spatial data becomes useless, as you can't correspond between cells and xy position. For example, I suspect that the notebooks merfish_tutorial_1,2a,2b show inaccurate maps of gene expression due to this issue (depending on how filtered cells are distributed across sections).

tmchartrand commented 1 year ago

I can't explain the number mismatch, but expect it's due to changes in some QC criteria - maybe @mkunst23 can? Just to note though, this is not an issue for using the remaining data as long as you join the anndata and metadata properly using the cell IDs.

rwollman commented 1 year ago

Thanks, you are correct that I can avoid this with a proper merge. My bad and thanks for pointing this out.

mkunst23 commented 1 year ago

Yes, the 4334174 cells are before filtering out cells with low average correlation scores (<0.5) when mapped against the reference taxonomy.

AllenInstitute / abc_atlas_access

Mismatch between cell metadata and expression matrix #28