Carlos Talavera-Lopez Heart single cell & single nuclei 10x

mshadbolt commented 4 years ago

Primary Wrangler: Marion Shadbolt

Secondary Wrangler: Enrique

Associated files:

Google Drive: https://drive.google.com/open?id=1gnB0anWGADLlwhGHLBQb3AowIpB8145p&authuser=mshadbolt@ebi.ac.uk&usp=drive_fs

Key Events

[x] Send initial instructions about contributing, including questionnaire and T&Cs
~~[ ] Receive project questionnaire and move project from potential to in_progress projects~~
[x] Assign the Primary Wrangler
[x] Receive T&C form filled out by at least 1 representative of group/project
[x] Send custom metadata template spreadsheet, how-to guide and directions for submitting data files
[x] Receive spreadsheet (to a sufficient standard to submit dataset)
[x] Receive data (to a sufficient standard to submit dataset)
[x] Curate metadata spreadsheet with ontologies
[x] Upload sheet to validate metadata (most likely in staging)
[x] Check linking - validate with graph validator
[x] Validate data files
[x] Ask the Secondary Wrangler for an end-to-end review of the project. Ask the Expertise Wrangler to review specific tabs if needed
[x] Get final approval of submission including metadata spreadsheet and data files
[x] Submit dataset to Production and hand off tracking to data dashboard
[ ] Email UX to inform that they're ready to send survey
[ ] Broker to SCEA

mshadbolt commented 4 years ago

Previous ticket is here: https://github.com/HumanCellAtlas/hca-data-wrangling/issues/411

Current status is that I sent largely filled out spreadsheet for review today as well as instructions for data upload.

I am hoping to proceed with archiving in EBI archives as soon as data is reviewed.

mshadbolt commented 4 years ago

sent nudge email

mshadbolt commented 4 years ago

emailed Paul to see if he has any advice on how to progress this.

mshadbolt commented 4 years ago

received updated spreadsheet and should be able to now finish the submission, hoping to get accessions by the early next week

ESapenaVentura commented 4 years ago

Secondary review done - A couple of comments: Donor organism

Medications: For a couple of donors, do they mean "cannabis"? It's written "cannibis"
Smoking history: More of a comment than anything, all donors are dead so it feels kinda wrong to state "current smoker". Maybe "Former smoker" would be better? (Not entirely sure)

Other than that LGTM! I have checked linking from specimen to donor and from donor to cell suspension, from cell_suspension to file it's kinda hard because the IDs for most files don't match but I trust they are ok. I have also checked the library preps and it looks fine!

mshadbolt commented 4 years ago

I changed the cannibis to cannabis. I didn't change the smoker status.

I have uploaded all the files that I have to the submission here: https://ui.ingest.archive.data.humancellatlas.org/submissions/detail?id=5f19a1fcfe9c934c8b83515f&project=ad98d3cd-26fb-4ee3-99c9-8a2ab085e737

There are 75 remaining files to be uploaded to the upload area and then transferred to the submission upload area.

I have provided @ESapenaVentura with the relevant information to be able to progress the submission through archiving tomorrow while I am away, presuming that the rest of the files are uploaded.

mshadbolt commented 4 years ago

Just 3 files left that we are waiting to get uploaded. 765/768 files valid against the submission.

I emailed Carlos to let them know in case they didnt realise and let Enrique know the last remaining files that will need to be transferred.

ESapenaVentura commented 4 years ago

Had to re-submit because there was a problem with the lane_indexes for the following files:

HCAHeart7664652_S1_L001_I1_001.fastq.gz
HCAHeart7698015_S1_L001_I1_001.fastq.gz
HCAHeart7664652_S1_L001_R1_001.fastq.gz
HCAHeart7698015_S1_L001_R1_001.fastq.gz
HCAHeart7664652_S1_L001_R2_001.fastq.gz
HCAHeart7698015_S1_L001_R2_001.fastq.gz
HCAHeart7664653_S1_L001_I1_001.fastq.gz
HCAHeart7702873_S1_L001_I1_001.fastq.gz
HCAHeart7664653_S1_L001_R1_001.fastq.gz
HCAHeart7702873_S1_L001_R1_001.fastq.gz
HCAHeart7664653_S1_L001_R2_001.fastq.gz
HCAHeart7702873_S1_L001_R2_001.fastq.gz
HCAHeart7757637_S1_L001_I1_001.fastq.gz
HCAHeart7985087_S1_L001_I1_001.fastq.gz
HCAHeart7757637_S1_L001_R1_001.fastq.gz
HCAHeart7985087_S1_L001_R1_001.fastq.gz
HCAHeart7757637_S1_L001_R2_001.fastq.gz
HCAHeart7985087_S1_L001_R2_001.fastq.gz
HCAHeart7702876_S1_L001_I1_001.fastq.gz
HCAHeart7702877_S1_L001_I1_001.fastq.gz
HCAHeart7702876_S1_L001_R1_001.fastq.gz
HCAHeart7702877_S1_L001_R1_001.fastq.gz
HCAHeart7702876_S1_L001_R2_001.fastq.gz
HCAHeart7702877_S1_L001_R2_001.fastq.gz
HCAHeart7757638_S1_L001_I1_001.fastq.gz
HCAHeart7985088_S1_L001_I1_001.fastq.gz
HCAHeart7757638_S1_L001_R1_001.fastq.gz
HCAHeart7985088_S1_L001_R1_001.fastq.gz
HCAHeart7757638_S1_L001_R2_001.fastq.gz
HCAHeart7985088_S1_L001_R2_001.fastq.gz
HCAHeart7829976_S1_L001_I1_001.fastq.gz
HCAHeart7985089_S1_L001_I1_001.fastq.gz
HCAHeart7829976_S1_L001_R1_001.fastq.gz
HCAHeart7985089_S1_L001_R1_001.fastq.gz
HCAHeart7829976_S1_L001_R2_001.fastq.gz
HCAHeart7985089_S1_L001_R2_001.fastq.gz
HCAHeart7664654_S1_L001_I1_001.fastq.gz
HCAHeart7757636_S1_L001_I1_001.fastq.gz
HCAHeart7985086_S1_L001_I1_001.fastq.gz
HCAHeart7664654_S1_L001_R1_001.fastq.gz
HCAHeart7757636_S1_L001_R1_001.fastq.gz
HCAHeart7985086_S1_L001_R1_001.fastq.gz
HCAHeart7664654_S1_L001_R2_001.fastq.gz
HCAHeart7757636_S1_L001_R2_001.fastq.gz
HCAHeart7985086_S1_L001_R2_001.fastq.gz
HCAHeart7702874_S1_L001_I1_001.fastq.gz
HCAHeart7702875_S1_L001_I1_001.fastq.gz
HCAHeart7702874_S1_L001_R1_001.fastq.gz
HCAHeart7702875_S1_L001_R1_001.fastq.gz
HCAHeart7702874_S1_L001_R2_001.fastq.gz
HCAHeart7702875_S1_L001_R2_001.fastq.gz
HCAHeart7702878_S1_L001_I1_001.fastq.gz
HCAHeart7702879_S1_L001_I1_001.fastq.gz
HCAHeart7702878_S1_L001_R1_001.fastq.gz
HCAHeart7702879_S1_L001_R1_001.fastq.gz
HCAHeart7702878_S1_L001_R2_001.fastq.gz
HCAHeart7702879_S1_L001_R2_001.fastq.gz
HCAHeart7702881_S1_L001_I1_001.fastq.gz
HCAHeart7702882_S1_L001_I1_001.fastq.gz
HCAHeart7702881_S1_L001_R1_001.fastq.gz
HCAHeart7702882_S1_L001_R1_001.fastq.gz
HCAHeart7702881_S1_L001_R2_001.fastq.gz
HCAHeart7702882_S1_L001_R2_001.fastq.gz
HCAHeart7656539_S1_L001_I1_001.fastq.gz
HCAHeart7702880_S1_L001_I1_001.fastq.gz
HCAHeart7656539_S1_L001_R1_001.fastq.gz
HCAHeart7702880_S1_L001_R1_001.fastq.gz
HCAHeart7656539_S1_L001_R2_001.fastq.gz
HCAHeart7702880_S1_L001_R2_001.fastq.gz

There were more than 1 set per library prep with the same lane_index. I have to investigate why did the ingest-graph-validator not pick this up.

Also project - contributor names was incorrectly formatted, causing a problem with the archiver. Corrrected that as well.

New submission: https://ui.ingest.archive.data.humancellatlas.org/submissions/detail?id=5f1b0e98fe9c934c8b835c80&project=ad98d3cd-26fb-4ee3-99c9-8a2ab085e737 New spreadsheet: Same folder, same name with an added suffix.

Currently files are being uploaded. Once uploaded I will set them all to "valid" to avoid waiting time. They have been already validated once in the previous submission and will be validated again in the bam conversion jobs and when uploaded to ENA, so it's fair to say this is pretty safe.

ESapenaVentura commented 4 years ago

New DSP submission: 67c34d20-eaf5-4aa8-bfc8-31dd4e97829f

Currently getting a "500 internal server error" when trying to retrieve entities. Will try again later

ESapenaVentura commented 4 years ago

Scratch that, new dsp submission: 6d20bce5-86fb-4a52-bd54-63838cab18a9

Just ran the file archiver on the EBI cluster. I needed to create a new folder under /hca/ because the one @mshadbolt created doesn't have write permissions for other users (Same folder name + "_enrique", happy to rename it afterwards)

256 jobs were sent (256 * 3 = 768 files), and so far some are already running and not finishing immediately, so it's looking good!

All the scripts used to parallelise the file upload are in the same folder.

ESapenaVentura commented 4 years ago

Little update:

The DSP submission contains the following submittables:

BioStudies: 1
BioSamples: 232
EnaStudies: 1
SequencingExperiments: 136
SequencingRuns: 256

mshadbolt commented 4 years ago

Project has been archived under ENA accessions ERP123138 - PRJEB39602

Hasn't yet been exported.

mshadbolt commented 3 years ago

@ESapenaVentura I have made a new spreadsheet in the brokering folder here: https://drive.google.com/open?id=1rIxeMhRg7jUFrf4a3dVqj-JTux0ZUuQP If you can do a review that would be amazing. thanks!

Main changes are:

Replaced publication with the nature publication, it is still early release so doesn't have a PMID as yet (also added against the project in ingest since this will be ignored anyways)
When there are multiple cell suspensions with the same donor and specimen, added _1 _2 etc
rejigged the sequence file input suspensions, library prep grouping, lane indices
copied the biosamples and ENA run accessions into spreadsheet against donor, specimen, cell suspension and sequence files, the additional cell suspensions don't yet have accessions until we add/update them with DSP

ESapenaVentura commented 3 years ago

@mshadbolt Done! Everything looks alright, I have just added the PMID for the publication (32971526)

Everything else looks fine!

mshadbolt commented 3 years ago

oh thanks! I literally looked yesterday and couldnt find it, oops! thanks for the review

mshadbolt commented 3 years ago

I believe I have fixed the experiment/run linking in the ENA submission but need to give the database some time to reindex so want to check in tomorrow. the basic process was:

create new biosamples entries for the new cell suspensions
create new empty experiments that link to the biosamples above
use the webin user interface to edit the xml for each run that should link to the new experiments

mshadbolt commented 3 years ago

I had made a start on converting this project to SCEA under accession E-HCAD-28 https://drive.google.com/open?id=1tNTsxbeepQh7F4ZF3B13vvghbR09D3Gp

I think there were multiple issues with the conversion that I raised with @ami-day because of some of the assumptions in the code around where various accessions get put from the geo-to-hca script, whereas this was manually curated from a contributor. I am not 100% sure if those things have been changed in the converter so it would be worth trying to run the conversion again, or if it will just require a lot of manual curation to ensure the correct accessions are put in the columns that the SCEA curators like to see them.

ami-day commented 3 years ago

Taking this on

ofanobilbao commented 3 years ago

@ami-day I can't tell by scrolling on the ticket where this dataset is on the DCP journey. I believe it was submitted to DCP. But I don't know if it should be on the Finished column or where. Could you, please, move where appropriate? Thanks! Apart from needing to be brokered to SCEA, does it need any updates or archiving? Thanks!

ofanobilbao commented 3 years ago

It really looks like finished so moving it there.

Wkt8 commented 2 years ago

From the comments it looks like this should be in 'broker to SCEA' as it's already been archived with fastq files in ENA (ERP123138) and is live.

ami-day commented 2 years ago

Assigned E-HCAD id: E-HCAD-47

ESapenaVentura commented 2 years ago

E-HCAD43 already exists please use the next one! https://gitlab.ebi.ac.uk/ebi-gene-expression/scxa-metadata/-/merge_requests/230 it's stuck due to de-prioritisation and a problem with file upload

ami-day commented 2 years ago

E-HCAD43

Ok I have assigned it E-HCAD-47,E-HCAD-48,E-HCAD-49 (split by library prep. method type). It definitely needs checking and potentially merging/correcting. The files can be found here: https://drive.google.com/drive/folders/1tNTsxbeepQh7F4ZF3B13vvghbR09D3Gp

ami-day commented 1 year ago

Google sheet: https://docs.google.com/spreadsheets/d/1rIxeMhRg7jUFrf4a3dVqj-JTux0ZUuQP/edit#gid=792951810

ami-day commented 1 year ago

In review by SCEA team and handed over to SCEA team (Gitlab).

ebi-ait / hca-ebi-wrangler-central

Carlos Talavera-Lopez Heart single cell & single nuclei 10x #3