GSE119212 - Decoding the development of human hippocampus

ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.

https://ebi-ait.github.io/hca-ebi-wrangler-central/

Apache License 2.0

7 stars 2 forks source link

GSE119212 - Decoding the development of human hippocampus #193

Open ESapenaVentura opened 3 years ago

ESapenaVentura commented 3 years ago

Primary Wrangler: Enrique Secondary Wrangler: Ami

Associated files:

Google Drive: https://drive.google.com/drive/folders/1N523DIJEXYvQwvCzqUEcGxcURot955h2

Published study links

Paper: https://www.nature.com/articles/s41586-019-1917-5

Accessioned data: GSE119212

Key Events

[x] convert published metadata to HCA spreadsheet
[x] manually curate dataset to meet HCA metadata standard
[x] Upload sheet to validate metadata
[x] Check linking using ingest graph validator
[x] Transfer files to ingest to validate data files
[x] Ask the Secondary Wrangler for an end-to-end review of the project. Ask the Expertise Wrangler to review specific tabs if needed
[x] Submit dataset to Production

ESapenaVentura commented 3 years ago

Files are in BAM format, converting them to FASTQ using 10x's tool "BamToFastqConverter" in the EC2

ESapenaVentura commented 3 years ago

The test run for the conversion went alright (Although it generated too many fastq). Using a higher number of reads/file and re-launching for all the bam files!

ESapenaVentura commented 3 years ago

The conversion yielded valid R1 and R2 and, for some reason, empty I1 files - I am re-formatting the spreadsheet to not have the I1 files, as they are not needed (Files already demux)

ESapenaVentura commented 3 years ago

The files have been uploaded and they are valid

Submission here https://contribute.data.humancellatlas.org/submissions/detail?uuid=7d5c9287-cf4a-4948-b4d5-5c6123ad78ad&project=6c040a93-8cf8-4fd5-98de-2297eb07e9f6

I ran the ingest-graph-validator tool and it fails the check for the 5 degree minimum graph, which is expected because this dataset contains bulk data (directly from tissue to files)

ami-day commented 3 years ago

@ESapenaVentura I have reviewed this, great to see lots of cells and primary brain tissue in this project! I made some minor changes to the sheet linked above.

I added the fastq creation method to the sequencing protocol tab.
1 of the donor organism's name and description was inaccurate/duplicated; I changed it from GW25 to GW27.
Some of the addresses in the Contributors tab are in Chinese, and some in English. Is this intended?
I changed both the project title and description slightly because the title did not mention scRNA-Seq (only ATAC-Seq) and the description started what seemed like mid sentence within an abstract. I think this is likely the script automatically extracted the appropriate xml field but the field itself was inaccurate/incomplete.
I switched the SRA study accession and Project accessions because I think they weren't in the correct columns (the column name is misleading so this happens to me often). I'm sorry if I'm incorrect and it causes an import error!

Apart from those changes, I would probably add more information about the collection protocol, to include the short paragraph about consent and that samples were derived from elective termination.

ESapenaVentura commented 3 years ago

Hi @ami-day , did you overwrite the spreadsheet? I can only see a "v3_noIndex", but I can't find the "v2_noIndex". It should be alright, I'm just curious!

Other than that, I think the changes are perfectly fine. I may have swapped them around because we have the pattern recognition wrong in our schema

About the addresses, I couldn't find the address in english. If you were able to, please let me know and I'll change it immediately!

About the elective termination, I can add that to the cause of death, but since we are treating the embryos as donors, the collection refers to the hippocampus collection, but not how it was "retrieved" from the mother.

I will make that last addition and, if you give me green lights, we can say this is ready for exporting!

ami-day commented 3 years ago

Hi @ESapenaVentura,

Yes I made changes in the original spreadsheet because they were very minor changes to resolve errors.
About the addresses: I usually google search the university/institute names separately to find the addresses if they are not available. If you have done this and were still unable to find them in English, I think it's ok to keep the Chinese.
The other additions you made sounds good! I think this is ready for upload to ingest prod. now!

ESapenaVentura commented 3 years ago

I added the consent in the collection protocol and revised the changes.

Ready to go!

ESapenaVentura commented 3 years ago

New submission: https://contribute.data.humancellatlas.org/submissions/detail?id=601970b2ac1792031eee0b88&project=6c040a93-8cf8-4fd5-98de-2297eb07e9f6

Files are validating but I think there is some trouble with validation

rays22 commented 3 years ago

This dataset has been exported to the Terra staging area.

ofanobilbao commented 3 years ago

@ESapenaVentura I moved this to Finished on the wrangling board as I believe this dataset is done from the DCP perspective right? Thanks!

ofanobilbao commented 3 years ago

@ESapenaVentura I just removed the dataset label. If/when we need updates on this we will need to add the label again so that it shows on the wrangling board. Does that sound ok?

ESapenaVentura commented 3 years ago

@ESapenaVentura To pick this one for the SCEA testing

ami-day commented 2 years ago

@ESapenaVentura did you start the SCEA conversion for this dataset?

ami-day commented 2 years ago

Is this the correct latest version? https://docs.google.com/spreadsheets/d/1ASSe6qt_sXWOdnCS6_1UJ2Hn-l2mMDUo/edit#gid=202266375 I will pre-convert it using the latest version of the hca2scea tool. I think the multi protocols might have caused an issue if you used a previous version of the tool.

ESapenaVentura commented 2 years ago

https://gitlab.ebi.ac.uk/ebi-gene-expression/scxa-metadata/-/merge_requests/230 E-HCAD43. Currently stalled due to a problem with file upload and de-prioritisation

ami-day commented 2 years ago

https://gitlab.ebi.ac.uk/ebi-gene-expression/scxa-metadata/-/merge_requests/230 E-HCAD43. Currently stalled due to a problem with file upload and de-prioritisation

Ok, I have just updated the sdrf file based on Anja's comments and re-uploaded the updated version to the gitlab branch. Now all that is left is that we need to transfer the fastq files to them via the EBI cluster (although not ideal, only way to do it). I have asked Anja where to send them on the cluster.