ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Contributor dataset: DopaminergicNeuronDifferentiation #265

Open ami-day opened 3 years ago

ami-day commented 3 years ago

Dataset/group this task is for:

Oliver Stegle's group DopaminergicNeuronDifferentiation Latest sheet: https://docs.google.com/spreadsheets/d/1Eblt1hHBAwk84BFiuTzbANGcJUYdtj5k/edit#gid=56358489 Folder: https://drive.google.com/drive/folders/1kRNGIsBsIviLHEPv1ipkLG0lrM8TLLbL

https://contribute.data.humancellatlas.org/submissions/detail?id=60a253c5901b6d17e5f3a4f0

Wrangler responsible for this dataset/lab:

Since this is a contributor dataset, there might be some contributor and/or publication metadata missing, I am yet to send it back to them to check it all looks ok to them. Want to get secondary review first.

Description of the task:

lauraclarke commented 3 years ago

@ami-day this didn't have the dataset tag so I don't think anyone had seen it needed secondary review

ami-day commented 3 years ago

@ami-day this didn't have the dataset tag so I don't think anyone had seen it needed secondary review

ohhh, I see. I posted it on the slack channel, but I should have labelled it too. I think I'll get back to them before review, as it's been a while now, and I think the review won't be particularly speedy - bits of information in different places.

lauraclarke commented 3 years ago

@Wkt8 @ESapenaVentura can either of you take on this for secondary review?

Actually, ignore me, I forgot this was being left till contributor feedback!

ami-day commented 3 years ago

Stalled. Waiting for Anna/Jeongbin to respond. I emailed them to remind them today.

rays22 commented 3 years ago

Secondary review

It looks fine other than the issues below.

ami-day commented 3 years ago

Thanks @rays22 , I have made the changes. The organoid data protocols were orphaned because I had asked the authors whether they should be included (I couldn't see any organoid samples in the info they sent me). They have been terrible at getting back so I have removed those protocols.

ami-day commented 3 years ago

I have asked Tony about the gdpr question

ami-day commented 3 years ago

Tony says it is fine to upload the raw data. So will start on that.

ami-day commented 3 years ago

the data is transferring a to hca-util area

ami-day commented 3 years ago

I am having problems with the data download from ENA. It looks to be an issue on the ENA server side. I have message Eugene to ask about this.

ami-day commented 3 years ago

Requested NCBI cloud delivery of data. It can only be delivered in SRA object format so I will need to convert it to fastq.

ami-day commented 3 years ago

Uploading the fastq to an hca-util upload area from local folder on EC2.

ami-day commented 3 years ago

Alegria is downloading the fastq from ENA (about 800-900 files) and then uploading to an hca-util upload area. The project submission is here: https://contribute.data.humancellatlas.org/submissions/detail?uuid=15142b86-b5b3-49cb-bad0-cb3eb8ba0a79&project=72c636f3-d51f-4e5d-9cf8-9b91427a9e0c the metadata is valid except that the fastq files need to be uploaded to ingest.

Wkt8 commented 3 years ago

Alegria is still downloading the fastqs from ENA, but I also noticed some discrepancies with the accession numbers and the fastq files.

Discrepancy 1: Cell Suspension ID: SAME6833352 with BioSamples ID: SAME6833352

In the spreadsheet, this entity is linked to: ERR4699951_1.fastq.gz ERR4699952_1.fastq.gz ERR4699953_1.fastq.gz ERR4699954_1.fastq.gz

However, on the ENA Browser (https://www.ebi.ac.uk/ena/browser/view/PRJEB38269) Those four fastq files should be linked to: SAMEA6833353 instead.

Discrepancy 2 Additionally, the project on the ENA Browser contains 532 sample accessions, for a total of 1064 fastq files, but there are only 968 metadata sequence file entities in the spreadsheet.

As such, we weren't sure if we should:

Due to this and the time constraints in downloading the files we have decided to wait for release 8 for export.

Wkt8 commented 3 years ago

Files are now in this hca-util upload area: s3://hca-util-upload-area/1268551e-f2d0-43eb-9511-968e46901e72/ as mentioned in https://app.zenhub.com/workspaces/operations-5fa2d8f2df78bb000f7fb2b5/issues/ebi-ait/hca-ebi-dev-team/432

ami-day commented 3 years ago

@amnonkhen unfortunately I need to review this dataset and potentially make edits before it gets exported, so can we please move the milestone from July to August as I won't be back in time to make the updates?

ami-day commented 3 years ago

Working on getting the missing fastq files. Removing the July milestone as it won't be done by then, changing it to the August milestone.

ami-day commented 3 years ago

Emailed Oli about the missing samples on 02/08/2021. Some samples in the ENA study are missing from the dataset files they provided us with and some samples in the files are missing from ENA.

ami-day commented 3 years ago

Decided to go ahead and upload this dataset to the September release milestone using the fastqs available in ENA given no response from the authors. Currently re-uploading ~80 fastq files which were displayed as invalid in ingest prod.

aaclan-ebi commented 3 years ago

@ami-day I've removed the extra files for this submission: https://contribute.data.humancellatlas.org/submissions/detail?uuid=d3dc95e5-7154-4d2a-b684-81a852cfb9d9

ami-day commented 3 years ago

Thanks @aaclan-ebi , it's now exporting :)

Wkt8 commented 3 years ago

Successfully exported and just needs the import form which ami is completing now.

ami-day commented 3 years ago

Curating to SCEA format. Assigning it E-HCAD-50

ami-day commented 3 years ago

Moving this to stalled as I'm not sure if this experimental design is suitable for scea. Have messaged them and waiting for a reply.

ami-day commented 2 years ago

Pre-converted the files and uploaded them here: https://drive.google.com/drive/folders/14m-j87nBFQUi3yCrIDhTrQ54KHoCLCe2

ami-day commented 2 years ago

Re-assigned the HCAD-id to E-HCAD-51.

Uploaded the idf and sdrf files: https://gitlab.ebi.ac.uk/ebi-gene-expression/scxa-metadata/-/merge_requests/296

ami-day commented 2 years ago

E-HCAD-58 Generated MAGET-TAB files Currently validating

ami-day commented 2 years ago

In review by SCEA team and handed over to SCEA team (Gitlab).