ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Flat metadata sheets for Kidney bionetwork #1201

Open idazucchi opened 11 months ago

idazucchi commented 11 months ago

Description of the task: We agreed to provide flattened metadata sheets for the included project list of the Kidney bionetwork - see here

The elegible projects ultimately are: 1 Lake et al. https://www.biorxiv.org/content/10.1101/2021.07.28.454201v1.full [biorxiv.org] Lattice 14 Arazi et al. https://www.nature.com/articles/s41590-019-0398-x [nature.com] 16 Der et al. https://www.nature.com/articles/s41590-019-0386-1 [nature.com] 36 Der et al. https://insight.jci.org/articles/view/93009/pdf [insight.jci.org] 43 Yu et al. https://www.frontiersin.org/articles/10.3389/fmed.2022.869284/full#h7 [frontiersin.org] 47 McEvoy et a. https://www.nature.com/articles/s41467-022-35297-z [nature.com] 17 Zheng et al. https://www.sciencedirect.com/science/article/pii/S221112472031514X?via%3Dihub [sciencedirect.com] 45 Abedini et al. https://doi.org/10.1101/2022.10.24.513598 [doi.org] 23 Young et al. https://www.science.org/doi/10.1126/science.aat1699 [science.org] 35 Chu et al. https://www.frontiersin.org/articles/10.3389/fonc.2021.719564/full#h3 [frontiersin.org] 30 Obradovich et al. https://www.cell.com/action/showPdf?pii=S0092-8674%2821%2900573-0 [cell.com] 2 Stewart et al. https://www.science.org/doi/10.1126/science.aat5031 [science.org] 20 Tabula Sapiens https://www.science.org/doi/10.1126/science.abl4896 [science.org] 40 Borcherding et al. https://www.nature.com/articles/s42003-020-01625-6 [nature.com] 18 Tang et al. https://www.frontiersin.org/articles/10.3389/fimmu.2021.645988/full [frontiersin.org] 21 Han et al. https://www.nature.com/articles/s41586-020-2157-4 [nature.com]

A few more projects might be included based on converation with Peng To add:

Acceptance criteria for the task:

idazucchi commented 11 months ago

possible additional projects: 9 - stalled due to missing linking information 32 - No data, no metadata, yes publication, no reply 33 - No data, no metadata, yes publication, no reply 37 - stalled due to missing metadata information 39 - No data, no metadata, yes publication, no reply 42 - No data, no metadata, yes publication, no reply

idazucchi commented 11 months ago

Priority: only after lung is done and Peng's ArrayExpress submission Action: communicate with Peng to confirm list and prototype of the metadata file @arschat

idazucchi commented 11 months ago

Arsenios to prepare a list of project that can easily be flattened and communicate that to Peng

arschat commented 11 months ago

Based on the list of analysis files each project has, here is the availability for extracting the cell barcodes of each of the kidney datasets.

availability number of projects
yes 22
no 1
no metadata available 8
lattice 2

I have downloaded for each project the analysis file that contains barcode information & sample ID and verified that the mapping between provided sample_ID and cell_suspension is identical or can be mapped.

Specifically across all kidney projects Number Project In Core Kidney List Ingest Status Short Name UUID extract
37 Kuppe et al. Included key metadata missing Kramann-Human-Smartseq2 128952c1-1906-4746-b4dd-6a10d1ff52d0 no metadata available
1 Lake et al. Included Published in DCP NichesHumanKidney lattice
19 Krebs et al. Core data Published in DCP PathogenInducedResidentMemory dc0b65b0-7713-46f0-a339-0b03ea786046 yes
9 He et al. Included key metadata missing Patrakka-Human-Smartseq2 662157e4-ba53-4766-975a-ac11920f153e no metadata available
14 Arazi et al. Included Published in DCP Hacohen-Human-CELseq2 2d559a6e-7cd9-432f-9f6e-0e4df03b0888 yes
16 Der et al. Included In progress TubularCellLupusNephritis 97fca723-d9e9-4263-9f67-335416086f47 no metadata available
36 Der et al. Included Published in DCP Der-Human-LupusNephritis-Nextera-C1 4627f43e-a43f-44dd-8c4b-7efddb3f296d yes
43 Yu et al. Included Published in DCP Xiao-Human-RNAscope 5f44a860-d96e-4a99-b67e-24e1b8ccfd26 yes
38 *Menon et al. Core data Published in DCP Menon-Human-FSG-10x3 29b54165-34ee-4da5-b257-b4c1f7343656 yes
47 McEvoy et a. Included Published in DCP KidneySexBasedTranscriptome 77c13c40-a598-4036-807f-be09209ec2dd yes
33 Cowman et al. Included No data/ no metadata MacrophagePrognosticIndicator 0aeaaab8-3e48-4877-a244-70d0dedc66cd no metadata available
17 Zheng et al. Included Published in DCP IgANephropathySTRT 2caedc30-c816-4b99-a237-b9f3b458c8e5 yes
32 Chen et al. Included No data/ metadata SurveyHumanGlomerulonephritis 0057c36c-06ce-4cdf-bff4-533ad13f090c no metadata available
39 Meng et al. Included No data/ metadata Ma-Human-10xtechnology 1bef1065-6e7d-4235-8a8d-535717d8d1e1 no metadata available
42 Huang et al. Included No data/ metadata ? eeff6c81-f29f-4e54-b33f-3c825b605d42 no metadata available
44 Zhao et al. Included No data/ metadata Wu-Human-10x3pv2 fa9f9bf1-62d6-4db1-9d36-8cef8806d6bf no metadata available
45 Abedini et al. Included Published in DCP KidneyFibroticMicroenvironment e925633f-abd9-486a-81c6-1a6a66891d23 yes
23 Young et al. Included Published in DCP Haniffa-Human-10x3pv2 d8ae869c-39c2-4cdd-b3fc-2d0d8f60e7b8 yes
41 Suriawanshy et al. Core data Published in DCP SuryawanshiKidneyAllografts 6e522b93-9b70-4f0c-9990-b9cff721251b yes
34 *Malone et al. Core data Published in DCP ChimerismKidneyTransplantReject 4ef86852-aca0-4a91-8522-9968e0e54dbe yes
35 Chu et al. Included Published in DCP Cheng-Human-10x3pv3 ee166275-f63a-4864-8155-4df86c9de679 yes
30 Obradovich et al. Included Published in DCP Califano-Human-10x3pv2 95d058bc-9cec-4c88-8d2c-05b4a45bf24f yes
31 Krishna et al. Core data Published in DCP ImmuneLandscapeccRCC 12f32054-8f18-4dae-8959-bfce7e3108e7 yes
2 Stewart et al. Included Published in DCP KidneySingleCellAtlas abe1a013-af7a-45ed-8c26-f3793c24a1f4 yes
3 Liao et al. Core data Published in DCP HumanAdultKidneyLiaoMo 2ef3655a-973d-4d69-9b41-21fa4041eed7 no
6 Wilson et al. Core data Published in DCP Diabetic Nephropathy snRNA-seq 577c946d-6de5-4b55-a854-cd3fde40bff2 yes
20 Tabula Sapiens Included Published in DCP tabulaSapiens 10201832-7c73-4033-9b65-3ef13d81656a yes
22 Muto et al. Core data Published in DCP lattice
24 Wu et al. Core data Published in DCP GSE118184KidneyOrganoid 16ed4ad8-7319-46b2-8859-6fe1c1d73a82 yes
40 Borcherding et al. Included Published in DCP ImmuneRenalCarcinoma 955dfc2c-a8c6-4d04-aa4d-907610545d11 yes
18 Tang et al. Included Published in DCP Tang-Human-FluidigmC1basedlibrarypreparation c5b475f2-76b3-4a8e-8465-f3b69828fec3 yes
21 Han et al. Included Published in DCP HumanCellLandscape 1fac187b-1c3f-41c4-b6b6-6a9a8c0489d1 yes
25 Zhang et al. Core data Published in DCP RenalTumorMicroenvironment 7c599029-7a3c-4b5c-8e79-e72c9a9a65fe yes
arschat commented 11 months ago

Next action, draft an email to show Peng the list of projects available for extraction. Ask Peng if a merged csv file with cells in each row, project name in column and all desired metadata in other columns, works for them. Ask Peng how they would like to name the cell_names where we only have the barcode (one analysis file per CS).

If they have generate all count matrices from fastq files and did not extract information from contributor matrix, this would be very important.

arschat commented 10 months ago

Different templates for the meeting here

arschat commented 10 months ago

Here is a folder with some examples of exporting the metadata.

arschat commented 9 months ago

On meeting on 21 Dec 23 Peng asked us if we could provide h5ads with the raw counts and all DCP metadata in the obs. The flat csv file works for them too but prefers the ready to integrate h5ad with the obs. Also, Peng updated us about the integration efforts stage, and the rich metadata are going to be needed in later stages, so we can have this in low priority.

Action items that were decided were:

arschat commented 9 months ago

After the investigation for the number of datasets that have merged anndata/seurat analysis files, the following stats came up (spreadsheet).

Analysis files Count
Unmerged 9
Semi-merged 5
Merged 4
No Analysis Files 5

Unmerged -> 1 CS per File Semi-Merged -> multiple CS per File but not all CS per File Merged -> all CS in 1 File There were some datasets that did not have analysis files, although we could provide the metadata at the CS level for all datasets (including HumanAdultKidneyLiaoMo that previously was tagged as unavailable. It has been wrangled as pooled analysis files, although in GEO & in paper a direct Sample to each File is mentioned)

All 4 merged datasets have now the csv files that is a combination of contributor metadata & all DCP metadata & cell barcode. I have uploaded all of them in the drive folder that was mentioned before. (Some files are very big, google sheets might take a while to open them). Xiao-Human-RNAscope has only 1 CS in the entire project, therefore, we will not share it as an example with the kidney integration team.

Haniffa-Human-10x3pv2 ImmuneLandscapeccRCC Xiao-Human-RNAscope IgANephropathySTRT

idazucchi commented 9 months ago

first 3 flat files sent to Peng, he asked if we can merge metadata + raw cell counts and merge multiple analysis files clarify if Peng is interested in unmerged flat files

arschat commented 8 months ago

Peng replied that they are interested in the flat csv per analysis file, and that they would like to add a bare barcode column too. Peng said that he is leaving by the end of March, so a deadline in the middle of March might be reasonable. Need to discuss with Gabs.

arschat commented 7 months ago

Flat metadata at the sample level for all datasets that have analysis files & spreadsheet in DCP has been deposited in this folder.

Next steps:

arschat commented 7 months ago

Ticket downprioratized for #1256

arschat commented 6 months ago

What is done:

However, Peng later asked about Tier 1 metadata at the CS level instead of flat files in the cell barcode in #1256. This experiment is complete. The difference between the two tasks is the merge of the flat_CS metadata with the barcodes, which although it was discussed internally to have that as an option, we currently do not have any request for that.

This ticket can now close.