ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE152543_OleicAcidMultipleSclerosis #247

Closed ami-day closed 3 years ago

ami-day commented 3 years ago

Dataset/group this task is for:

GSE152543_OleicAcidMultipleSclerosis https://docs.google.com/spreadsheets/d/1_nM-tL6JyLTzhUKY9IxKWVRhPSofuBjl/edit#gid=1600387747

Wrangler responsible for this dataset/lab:

Primary: Ami Secondary: Wei

Acceptance criteria for the task:

ami-day commented 3 years ago

Need to add bulk samples metadata.

ami-day commented 3 years ago

Stalled as ingest can't currently accept spreadsheets with this many sequence file names.

lauraclarke commented 3 years ago

@clairerye are you aare of this, is this still a problem?

clairerye commented 3 years ago

I am not aware of this. Is it the file names or the number of files? @ami-day @aaclan-ebi are you aware of if we limit this somewhere? Is it possible to do it as two submissions or is that a terrible idea?

aaclan-ebi commented 3 years ago

Ah, this might be the limit set in the no. of rows in the spreadsheet importer. We should be able to increase that, @ami-day how many sequencing files were there?

ami-day commented 3 years ago

@clairerye and @lauraclarke I remember making a ticket for this in the painpoints column. It is the number of files (or rows) in the spreadsheet. It will fail to import if there are too many rows in the sequence file tab (where each row is each fastq file name)

ami-day commented 3 years ago

@aaclan-ebi there are approx. 23,500 rows! Could we increase it to 50,000 in case this happens again with a larger dataset?

ami-day commented 3 years ago

This is now unblocked as Jacob has made the necessary changes to ingest. It is showing as a valid spreadsheet in ingest staging (https://staging.contribute.data.humancellatlas.org/submissions/detail?id=60702e7bed34714563004d7a) and is ready for 2ndary review.

ami-day commented 3 years ago

I think this won't make the April release because 1. it needs to be 2ndary reviewed and 2. there are more than 23,500 fastq files that need to be uploaded which i am guessing is going to take a while

lauraclarke commented 3 years ago

I didn't think this was targetting release 5, it isn't associated with a milestone

ami-day commented 3 years ago

Requesting the data files via NCBI cloud delivery. This actually significantly reduces the number of separate fastq files, as they have grouped all the run accessions into 1 experiment accession (in ENA there is a fastq file for each run and many runs per experiment (2931 experiments in total). If someone has the capacity review this dataset by early next week it might be good to add to release 5 if we can.

Wkt8 commented 3 years ago

Have performed the secondary review - but was unable to complete the 'sequence files present in the s3 bucket' part of the secondary review. The sequence files tab is also still waiting for file names, as ami has mentioned above.

Project Tab: Project Title should be the published paper, not the preprint title? 'Oleic acid restores suppressive defects in tissue-resident FOXP3 Tregs from patients with multiple sclerosis'

Collection Protocol Tab: Collection Method Ontology ID EFO:0009121, 'blood draw’ I couldn’t find an appropriate ontology ID in HCAO for ‘aspirating adipose tissue’ so general collection probably works, unless we want to request for that ontology term.

Specimen from organism tab: Genus Species Ontology ID Organ Ontology ID - I’m not sure if ‘adipose tissue’ is the right term for the 'organ'. Looking at the hierarchy of ‘adipose tissue’ in the HCAO I would put ‘Connective Tissue’ or some other term, and ‘adipose tissue’ for the organ part - but this shouldn’t block the release.

Enrichment Protocol Tab Enrichment Method Ontology ID: EFO:0009112, ‘density gradient centrifugation’ EFO:0009109, 'magnetic affinity cell sorting' EFO:0009108, 'fluorescence-activated cell sorting'

Library Preparation Protocol Tab Maybe move the ‘RNeasy Micro Kit (QIAGEN)’ from the library_protocol_bulk description to the nucleic acid conversion kit column?

Sequence File Tab Content Description Ontology ID data:3494, ‘DNA sequence’

Apart from this, it looks great! Nice job separating the enrichment protocols, it's a large dataset that looks very interesting re: MS.

ami-day commented 3 years ago

Thanks @Wkt8 . NCBI say that have completed my request to transfer like 30,000 fastq!! so will work on this now.

ami-day commented 3 years ago

uploading the fastq to an hca-util upload area

ami-day commented 3 years ago

fastqs are validating in ingest

ami-day commented 3 years ago

Error syncing 2 files: https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/316

ami-day commented 3 years ago

syncing files to ingest

ami-day commented 3 years ago

submitted the project in ingest

ami-day commented 3 years ago

This has been exported on 24052021.

clairerye commented 3 years ago

Thanks, could you please delete the duplicate of this project in ingest. There are currently 3 versions which is very confusing. image

ami-day commented 3 years ago

This has been approved by Anja.