Raw Data and H5AD Files download?

dtranWoix1 commented 1 year ago

First off, kudos on maintaining one of the most organized Github repositories I've come across!

I'm currently trying to recreate a workflow from your repository, and I'm working with the notebook 01a_data_import_and_preprocessing.ipynb. However, I've run into a bit of a roadblock as I am unable to locate some of the required data files:

1. raw_data_path.txt: Used to set dir_data in the notebook.

with open("../../../raw_data_path.txt", "r") as file:
    dir_data = file.readline().strip()

Is this the ones from cellxgene website? The core or full?

2. Various H5AD files such as:

LCA_Bano_Barb_Jain_Kras_Lafy_Meye_Mish_MishBud_Nawi_Seib_Teic_RAW.h5ad
LCA_Bano_Barb_Jain_Kras_Lafy_Meye_Mish_MishBud_Nawi_Seib_Teic_RAW_subjfilt_ann.h5ad
LCA_Bano_Barb_Jain_Kras_Lafy_Meye_Mish_MishBud_Nawi_Seib_Teic_RAW_filt_ann.h5ad

3. I tried download from multiple sources that was listed within the Git and the publication but there was nothing:

a. wget --user iGJBCYX8sMPL4n8 --password HLCA_results https://hmgubox2.helmholtz-muenchen.de/public.php/webdav/results.zip -O results.zip. This doesn't give me the data.
b. https://zenodo.org/records/7599104. This also doesn't give me the data.

The readme advises to submit an issue for data-related queries, hence this message. I did attempt to download the data from cellxgene, but it doesn’t seem to match what’s required for the notebook.

Could you point me towards where I might be able to download these specific datasets? I'd also appreciate any guidance on how to locate data for subsequent steps without having to trouble you each time.

Thanks in advance for your help!

LisaSikkema commented 1 year ago

Thanks @dtranWoix1 !

The file on cellxgene is the very final version of the atlas, after cleaning of variable names, removal of non-important metadata columns, harmonising of gene names etc. It is not the file that I was working with in the notebooks that you're looking at. Those notebooks use different intermediate files that have gone through specific preprocessing steps (the preprocessing you can also find in the notebooks), and they are not as clean as the final file. As there are so many intermediate files (and they're quite large), I haven't uploaded them all but am happy to share any that you want to work with.

Would you just want the three h5ads you listed, or only a subset of them, or more?

As to the results.zip file: this should contain outputs of the analyses, but no single-cell data. Did you manage to download the file, or did that not work at all?

The zenodo link leads to the scANVI data integration model and the scArches model extensions, and related files, but does not have any large data files.

dtranWoix1 commented 1 year ago

Hi @LisaSikkema!

I hope this message finds you well.

Thank you very much for your quick and comprehensive response!

I am currently working through the 01a_data_import_and_preprocessing.ipynb notebook, and I have identified a few specific H5AD files that I need to proceed:

LCA_Bano_Barb_Jain_Kras_Lafy_Meye_Mish_MishBud_Nawi_Seib_Teic_RAW.h5ad
LCA_Bano_Barb_Jain_Kras_Lafy_Meye_Mish_MishBud_Nawi_Seib_Teic_RAW_subjfilt_ann.h5ad
LCA_Bano_Barb_Jain_Kras_Lafy_Meye_Mish_MishBud_Nawi_Seib_Teic_RAW_filt_ann.h5ad

Could you please provide access to these files? For now, I don't think I will need a subset of them or anything more.

Additionally, I was unable to find the raw_data_path.txt file, which is referenced in the notebook for setting up the data directory. If you have it available, could you also share it?

On another note, I wanted to confirm that the download and extraction of the results.zip file were successful, and all contents are in perfect order.

Additionally, to streamline our communication, could you please provide an email address where I might contact you directly, should that be more convenient for you?

LisaSikkema commented 1 year ago

The raw_data_path.txt file is just a txt file with a single line containing a path to a directory on our cluster, I don't think that is of any use for you! It leads to a folder with all the raw data of the individual datasets.

As for the three h5ads you asked for: the 2nd and 3rd are produced from the first one in the notebook you mentioned, so I don't know if it's really useful for you to get all three of them. They're about 9Gb each.

You can just email me (Lisa.sikkema@helmholtz-munich.de) and let me know if you really want all three of them, then I'll upload them to our file sharing drive.

LungCellAtlas / HLCA_reproducibility

Raw Data and H5AD Files download? #12