ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE138669 - SkinSystemicSclerosis #859

Open idazucchi opened 2 years ago

idazucchi commented 2 years ago

Project short name:

SkinSystemicSclerosis

Primary Wrangler:

Ida

Secondary Wrangler:

Ami

Associated files

Published study links

Key Events

idazucchi commented 2 years ago

I worte to the authors (thread) to get an accession for the published data.

GSE138669 Contains the 6 healthy samples from this publication Contains 4 additional healthy samples and 12 diffuse cutaneous systemic sclerosis --> all samples are part of a newer publication

It makes more sense to wrangle just the newer publication with the accessioned data since they come from the same lab --> I'm working on that

I will let Maria know about the additional 4 healthy skin samples since they are intrested in healthy samples the most

idazucchi commented 2 years ago

Bam-to-fastq The files are in bam format so I've converted them using the bamtofastq tool in the ec2. Some things don't look right:

@E00440:237:HM3CLCCXY:7:1112:6258:17008 2:N:0:0

+

@E00440:237:HM3CLCCXY:7:1208:1763:16147 2:N:0:0



I'm not sure the issues with the fastq files can be solved in time for the release
MightyAx commented 2 years ago

@ami-day has some prior experience with this problem.

idazucchi commented 2 years ago

Analysis files There is on h5 file per donor but the contents of those 22 h5 files are identical, so I think it might be one integrated analysis file for all donors. There is not enough time to confirm this with the authors and if I submit the files now it could prove difficult to delete the wrong files from the DCP.

I'll contact the authors again and hopefully get h5ad files with more metadata as well

idazucchi commented 2 years ago

Graph valid!

I've omitted the 3 empty I1 files and the analysis files/protocol

ami-day commented 2 years ago

This looks really good to me, nice work!

I can't see any issues other than, I am not sure why the Analysis Protocol and File tabs were not included, given they are available in GEO (e.g. project level: GSE138669_RAW.tar and sample level: GSM4115868_SC1raw_feature_bc_matrix.h5).

idazucchi commented 2 years ago

I'm exporting the dataset

Analysis files

This will need to be updated to add analysis protocol and files - I'll do it as soon as I can confirm that the h5 files are different for each donor Both the size (737’280 cells and 33’538 genes) and the fact that all files have the same dimensions is suspicious The total number of cells reported in the paper is 65'199 - even taking into account quality checks it still looks like too many cells

idazucchi commented 2 years ago

Verified in the data browser!

gabsie commented 2 years ago

There's a problem here with spreadsheet generation for this dataset (stuck for indefinite time). @idazucchi will also create a bug ticket for Dev

idazucchi commented 2 years ago

I've added a new submission (spreadsheet here) with the analysis files. The linking to the existing entities was done using the uuids and checked manually because the spreadsheet cannot be generated.

I'm now exporting

arschat commented 3 months ago

There are 3 publications referencing the same data.

  1. https://doi.org/10.1016/j.jid.2017.09.045
  2. https://doi.org/10.1002/art.41813
  3. https://doi.org/10.1038/s41467-021-24607-6

We started from 1, then authors pointed us to 2, and now bionetwork references 3 in their list (through the data portal tracker).

Publications 2 and 3 reference the same GEO accession GSE138669 and the same donor_ids in the figures. It's a different analysis of the same data.

For consistency among other components, the project title in the tracker is Myofibroblast transcriptome indicates SFRP2hi fibroblast progenitors in systemic sclerosis skin 3 is added to the project, but title has not been changed.

PS I've already added the data_use_restriction field & bump the project version for this project.

idazucchi commented 4 days ago

@arschat did you also export the changes?

arschat commented 4 days ago

No, I've not exported the changes. I asked Dave about this but did not get a reply on that question specifically.

10.1038/s41467-021-24607-6 7f351a4c-d24c-4fcd-9040-f79071b097d0 both publications point to the same GEO accession (GSE138669) and reference the same donor IDs. GEO also, points back to the second publication (10.1002/art.41813), and that's why we decided to mention this publication in the project. The difference between the two studies is the analysis that was done along with non-sequencing experiments, but authors did not share any integrated objects that might be different between the two analysis (only fastq & raw count matrices). We could add the first publication (10.1038/s41467-021-24607-6) as well, but let me know if you would like us to change the title as well.

arschat commented 3 days ago

@arschat change title to be consistent with tracker. export project only

arschat commented 3 days ago

Ready to export metadata only @idazucchi

idazucchi commented 2 days ago

project metadata exported!

from the gpc bucket

4500  2023-07-27T05:56:35Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/7f351a4c-d24c-4fcd-9040-f79071b097d0/metadata/project/7f351a4c-d24c-4fcd-9040-f79071b097d0_2022-08-25T14:22:53.860000Z.json
4589  2024-09-27T16:04:11Z  gs://broad-dsp-monster-hca-prod-ebi-storage/prod/7f351a4c-d24c-4fcd-9040-f79071b097d0/metadata/project/7f351a4c-d24c-4fcd-9040-f79071b097d0_2024-09-26T14:30:26.573000Z.json

import form filled out