Add additional human data to organoid dataset

mshadbolt commented 4 years ago

Dataset/group this task is for: ingest

New 2nd submission: https://docs.google.com/spreadsheets/d/1iF0JnTjk89UxZ6HJAu42CdzaXvyfleKPYsLaYjSZNBA/edit#gid=208686926

https://contribute.data.humancellatlas.org/submissions/detail?uuid=6baac52c-93e0-4278-96cf-cd869d501178&project=005d611a-14d5-4fbf-846e-571a1f874f70

This relates to the organoid dataset: https://data.humancellatlas.org/explore/projects/005d611a-14d5-4fbf-846e-571a1f874f70

Description of the task:

As I was collecting contributor matrices I realised that the data that we have has been published in the papers here:

https://www.nature.com/articles/s41586-019-1654-9?draft=collection

The nature paper has multiple arrayexpress accessions, I crossed out ones that we probably don't want

E-MTAB-7552 (single-cell RNA-seq data based on 10x Genomics) - this contains the data we have in HCA currently, there is more human data in this submission we could add
E-MTAB-8234 (single-cell RNA-seq data based on Fluidigm C1/Smart-seq2)
E-MTAB-8089 (single-cell ATAC-seq of human organoids)
~~E-MTAB-8043 (single-cell ATAC-seq of chimpanzee organoids)~~
~~E-MTAB-8083 (single-cell ATAC-seq of bonobo organoids)~~
~~E-MTAB-8087 (single-cell ATAC-seq of macaque organoids)~~
E-MTAB-8228 (the bulk ATAC-seq data) - ?
E-MTAB-8230 (snRNA-seq data of the adult brain samples)
E-MTAB-8231 (bulk RNA-seq data).

The expression data are also available for exploration in scApeX via the link https://bioinf.eva.mpg.de/shiny/sample-apps/scApeX/

There is also the issue that the data we have has been accessioned twice, once by us and once as part of the arrayexpress submission

Acceptance criteria for the task:

[ ] links are made between the data we currently have and what has been accessioned, publications and supplementary links are added
[ ] other arrayexpress submissions are assessed and relevant data added to the existing submission (or new submissions are made)
[ ] plan for what to do about the double accessioned biomaterials and files

ami-day commented 2 years ago

Updating with new donors and samples (2nd submission)

ami-day commented 2 years ago

New submission metadata complete but need to ask @aaclan-ebi about which submission 1 uuids to add.

ami-day commented 2 years ago

The new submission is almost ready. The metadata is uploaded but @aaclan-ebi is going to first delete some errors from an old submission. The raw fastq files are downloading from NCBI.

Wkt8 commented 2 years ago

Put into secondary review when ready!

idazucchi commented 2 years ago

Picking this up for secondary review!

ami-day commented 2 years ago

@idazucchi and I need to discuss this dataset, setting up a meeting for Monday

idazucchi commented 2 years ago

Hello! I think you did a good job on this very complex dataset. As we discussed I have some suggestion for this project: the most important ones are those on the modelling of the organoids and cell suspensions, the rest are suggestions of additional fields you could fill in.

Donor

For donors HC1, HG1 and HE1 brain samples were collected post-mortem, so alive at collection should be no

Collection protocol

I would remove collection_protocol_hIPSC_RIKEN_resource and collection_protocol_hIPSC_system_biosciences, and use the induction protocol from sumbission 1 ipsc_induction_protocol_1
in submission 1 there’s no collection protocol for the specimen, do you think that we can use collection_protocol_skin for submission1’s specimens as well?

Cell line

You could add some information about:

the cell line supplier
growth medium: cell were cultivated using mTeSR1 (StemCell Technologies) on matrigel-coated plates
feeder layer: standard feeder-free conditions
mycoplasma testing: PCR validation, pass
cell type
the karyotype is normal, can we say normal, like in the disease field of is it just blank?
disease: normal? there’s no indication of disease
the cell line publication information: I think you’ve included it the collection protocol
for h409B2_1 and SC102a1_1 I would reuse the iPSC induction protocol from submission1 rather than the collection protocols collection_protocol_hIPSC_RIKEN_resource and collection_protocol_hIPSC_system_biosciences

Organoid

I would remove the biosample accession from this tab because I think that they more accurately represent the cell suspensions
E-MTAB-7552:
- the pooled organoids are included in submission 1, all the files are there too: I think they can be removed should be removed
- Org32d_H9_h409B2 should be split into two organoids and pooled at cell suspension level
E-MTAB-8234:
- I think it should 3 organoids based on the info in supplementary materials table 1 and then 125 cell suspensions
E-MTAB-8098:
- some organoids have 2 cell lines as input need and need to be split up. They have been pooled at cell suspension level

Cell suspension:

You could add the INDSC sample accessions, which are available for all cell suspensions
E-MTAB-7552:
- the pooled cell suspensions are in submission 1, all the files are there too: should be removed
- You could add the viability method trypan blue
E-MTAB-8234:
- are the codes like pate labels? like for SS2?
- the estimated cell count could be 1 cell per cell suspension
- You could add the viability method trypan blue
- I would swap the dissociation protocol Cerebral_whole_organoid_dissociation with Cerebral_dissected_organoid_dissociation_enzymatic because it’s more similar to what is described for organoids processed with Fluidigm
E-MTAB-8231:
- the dissociation protocol is missing
E-MTAB-8228:
- the organoid age is available from ArrayExpress, but measured in months, either 2 or 4

Library prep

SmartSeq2_library_preparation is not needed
for bulk_RNA_library_preparation you could fill the the preparation kit field

Sequence file

You could fill in the read length field, the information is available from ArrayExpress
The library preparation ID is filled with Experiment accessions, I think that they shouldn’t be there
E-MTAB-8234:
- just C1_library_preparation, there’s no SmartSeq2 data

Analysis file

You could add the analysis files available from E-MTAB-7552

Expected number of entities

This is an attempt to check how many files/organoids/cell suspension to expect in the finished spreadsheet

AE project accession	#oragnoids	#cell suspension	#fastqs
E-MTAB-7552	13	12	42	without mixed samples from early organoid development \|\| 2 libraries sequenced twice
E-MTAB-8234	3	125	250	I think it's 3 organoids based on the info in Supplementary table1 - fluidigm
E-MTAB-8089	1008	696	1392	under the assumption that each set of files corresponds to a different organoid
E-MTAB-8230				not eligible: post mortem donors human samples were processed together with ape samples
E-MTAB-8231		30	60	post mortem donors\|\|bulk data
E-MTAB-8228	12	12	24

Total	1036	875	1768

Duplicated Biosample accessions

This is still an open question, I'm not sure of how we want to address it

gabsie commented 2 years ago

@ami-day will finish this for next release.

ami-day commented 2 years ago

Thanks @idazucchi I have made almost all of these changes. About the potentially duplicate biosamples: I decided to remove those biosamples from this submission update. I am now downloading the fastq files so I can fill the file names and upload to ingest.

ami-day commented 2 years ago

upload area: 04dfcbed-8a97-4788-9de2-eacb1f64c7ef/

ami-day commented 2 years ago

SRA file conversion is incredibly slow. No other options to get the raw data. Removing this dataset from release 14 and will tag it with release 15.

prabh-t commented 2 years ago

Ami having issues getting some data (from ENA?). FASTQs not available. Ami and Enrique to have a meeting to look into this.

jacobwindsor commented 2 years ago

This will probably be done today

ami-day commented 2 years ago

Bug in prod. Needs to be fixed (currently fixes are in staging). Then i need to re-upload this submission.

prabh-t commented 2 years ago

Fix is now in prod. Submission cannot be deleted. Alegria to try to delete again.

aaclan-ebi commented 2 years ago

The submission with uuid 8e1cee01-4bd4-4c97-a201-a80fa8ec4b5f is now deleted. I'm not sure if my attempts the other day went thru but when I check today the submission is already gone and there's already a new submission.

ami-day commented 2 years ago

Alegria and I are looking into an issue in ingest.

ami-day commented 2 years ago

syncing the fastq files to ingest prod.

ami-day commented 2 years ago

submitted.

ami-day commented 2 years ago

exported.

MightyAx commented 2 years ago

In attempting to re-export this as part of the r16 missing file descriptor issue, we have had an exporter failure where the project's supplementary files failed to export to terra because they are "DCP1" files:

hipsci-ipsc-pipeline.pdf
Dissociation_protocol_130-092-628.pdf
CG00052_SingleCell3_ReagentKitv2UserGuide_RevE.pdf

These supplementary files are exported with every experiment so all the experiments "failed" to export because the file metadata was a different schema in DCP1 and DCP2 metadata has not been stored against these files (file type & size and so on) But the error is kind of irrelevant because we would not have been able to retrieve the file to send to terra since the original HCA S3 bucket no longer operates, so we would not have been able to get the files to add to terra.

Either these supplementary files are important to add with this new dataset and we should:

remove the old supplementary file references
get a copy of the files (either from dcp1 or our wrangling files)
upload the files as supplementary files to the project
reexport

Or the files are not relevant to the new dataset, then we can:

remove the files from the project
reexport

@ami and @yusra-haider, lets talk about this tomorrow.

s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/hipsci-ipsc-pipeline.pdf
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/Dissociation_protocol_130-092-628.pdf
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/CG00052_SingleCell3_ReagentKitv2UserGuide_RevE.pdf

ipediez commented 2 years ago

Do we need the supplementary files or not? A discussion needs to be made between @ami-day and @idazucchi , maybe for release 18 rather than 17 so no priority here

ESapenaVentura commented 2 years ago

I have the 3 supplementary files at hand, let me know if you need them (Also sent them to @idazucchi )

idazucchi commented 2 years ago

Enrique retrieved the missing files from the DCP. We need to discuss how we want to proceed

ofanobilbao commented 2 years ago

@idazucchi @ami-day @yusra-haider @ESapenaVentura to have a discussion on this today

ami-day commented 2 years ago

Hi @MightyAx If these files will still be present associated with the initial dataset when it got submitted but they are not present in the 2nd submission (update) dataset, that is ok. Can they be deleted and the submission re-exported?

idazucchi commented 2 years ago

At the moment the plan is to:

upload the supplementary files in a bucket
give them the same metadata as the orginal ones, uuid and fileID (?) in particular
change the cloudURL to point to this new bucket
export

This should allow ingest to find the files and export correctly. This solutions should allow future updates without running again into this issue

please @yusra-haider feel free to edit my explanation to make it more accurate/detailed

@ami-day we can't really delete these files because they are linked to the protocols of the 1st submission, which are reused in the 2nd submission

yusra-haider commented 2 years ago

I've been looking at the code, and it seems it won't be possible to update the cloudURL using the rest api calls, as we have guards against that.

I'm planning on:

unlinking the supplementary files from the project
clearing the staging area for this project
re-exporting the project
linking the supplementary files back to the project

this is a one time solution and we need to come up with a more permanent solution for the future.

I'm planning on testing this out on some other environment before doing these steps for this project on prod.

ESapenaVentura commented 2 years ago

Tried:

Unlinking supplementary files and exported

It's been exporting for a while but the status did not change.

We suspect the dataset is exported, need to investigate

yusra-haider commented 2 years ago

Steps taken to fix this issue:

[x] remove supplementary files from project Command used: curl -X PUT -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles"
[x] set submission back to graph valid state
[x] redeploy state tracker to sync state
[x] delete all the exported files / metadata for this project on GCP, to give exporting a fresh start
[x] re-export choosing the option: clicked Submit metadata and data to HCA
[x] link the supplementary files back to the project Commands used:

curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2b"

curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2c"

curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2d"

This is a one off fix and this issue will arise again when we want to update and re-export something in this dataset

yusra-haider commented 2 years ago

The submission shows that it's still stuck in exporting, but we (me and @ESapenaVentura) checked the data on the GCP bucket, and the project has been successfully exported to the staging area.

An import request has been sent for this dataset for Release 17.

There are no error logs in the exporter or core for this submission, and I'm trying to figure out why the submission is stuck in exporting state despite the successful export to the staging area

yusra-haider commented 2 years ago

This check might be the culprit in this case. For this dataset the number of processes in the submission being exported is 1818, whereas the number of processes in the GCP bucket is 1820, because of linkages with the previous submission.

Wkt8 commented 2 years ago

This still needs the state of the dataset to be set to 'exported'

Wkt8 commented 2 years ago

Has been set to 'exported'.

ami-day commented 2 years ago

Was the import form submitted?

Wkt8 commented 2 years ago

Yes. An import form was submitted for Release 17 (on 31st May) specifying New raw data, New contributor-generated matrix, New metadata, Updated metadata

idazucchi commented 2 years ago

Needs an update

[x] add Ami as data curator
[ ] some authors are missing - in progress

ebi-ait / hca-ebi-wrangler-central