Sharing compressed data

adamgayoso commented 2 years ago

Hello, this is such a cool project!

I was wondering if the compressed anndata objects could be shared on the website. For example, for the full dataset, saving like write_h5ad(path, compression="gzip") reduces the file size to ~5gb from 15gb. While it takes a bit longer to save with compression, reading is still pretty fast. I also noticed an issue with adata.obs["donor"] where it's mixed string and float types, so also saving it with adata.obs["donor"] = adata.obs["donor"].astype(str) would be appreciated.

We are working on faster implementations of scvi-tools using jax. In this notebook we can process 150k cells in <5 minutes on Colab. I was hoping to create a new tutorial with your dataset to show that we can process 900k cells in < 1 hr (integration + visualization, all for free!).

emdann commented 2 years ago

Hi @adamgayoso, thanks a lot for these suggestions! I will regenerate the datasets for the website and notify here as soon as they are up. The jax speed up sounds really cool, and we'd be super happy for our data to be featured in the tutorials of course! I'd be very keen to peak at the first results, and of course I'd be very happy to help with interpretation or writing up vignettes if needed.

adamgayoso commented 2 years ago

Thank you @emdann!! It would also be beneficial if the genes used to run the scvi model were included (in adata.var) so I can better reproduce the results. Also, what batch_key was used? I saw "bbk" I think in the notebooks on this repo.

emdann commented 2 years ago

The batch key used is a concatenation of method (10X protocol, 3' or 5') and donor (see under "Add batch key " here)

adamgayoso commented 2 years ago

The new version seems to give reasonably similar results to the old version which is good. I also noticed that Scanpy umap plotting of the celltype is not working for some reason.

emdann commented 2 years ago

The .h5ad objects for download should now be updated. Could I ask you to try again to check if the plotting problem persists?

adamgayoso commented 2 years ago

Yes looks great! I still have an issue plotting

bdata.obs["celltype"] = np.array(list(bdata.obs.celltype_annotation))
sc.pl.embedding(bdata, basis="X_mde", color=['celltype'], frameon=False)

Plotting celltype_annotation alone gives an error

Could it be there are too many categories for scanpy?

emdann commented 2 years ago

Notes from troubleshooting attempts:

Part of the problem could be the NaNs (https://github.com/theislab/scanpy/issues/2133): I found the maternal contaminants were not flagged correctly in this object, these are cells with adata.obs['celltype_annotation'] set to NaN. I will modify that in the file ASAP, but for now you can try filtering those out before plotting.

After filtering out nans I still get all gray, so it might indeed be a problem with scanpy trying to handle too many categories (and pandas update possibly?). Also setting groups throws a pandas error.

I usually plot annotations by lineage, using the assignment saved here. The best workaround I can suggest for now is trying something like:

import json
with open('Pan_fetal_immune/metadata/anno_groups.json', 'r') as json_file:
    anno_groups_dict = json.load(json_file)

adata.obs['annotation_plot'] = np.nan
lineage = 'B CELLS'
lineage_cells = adata.obs['celltype_annotation'].isin(anno_groups_dict[lineage])
adata.obs.loc[lineage_cells, 'annotation_plot'] = adata.obs.loc[lineage_cells, 'celltype_annotation'].copy()

sc.pl.umap(adata, color=['annotation_plot'], title=lineage)

Screenshot 2022-03-03 at 09 43 01

This tells me I should probably save the annotation groups in adata.obs...

Teichlab / Pan_fetal_immune

Sharing compressed data #48