ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Add additional human data to organoid dataset #121

Closed mshadbolt closed 2 years ago

mshadbolt commented 4 years ago

Dataset/group this task is for: ingest

New 2nd submission: https://docs.google.com/spreadsheets/d/1iF0JnTjk89UxZ6HJAu42CdzaXvyfleKPYsLaYjSZNBA/edit#gid=208686926

https://contribute.data.humancellatlas.org/submissions/detail?uuid=6baac52c-93e0-4278-96cf-cd869d501178&project=005d611a-14d5-4fbf-846e-571a1f874f70

This relates to the organoid dataset: https://data.humancellatlas.org/explore/projects/005d611a-14d5-4fbf-846e-571a1f874f70

Description of the task:

As I was collecting contributor matrices I realised that the data that we have has been published in the papers here:

https://www.nature.com/articles/s41586-019-1654-9?draft=collection

The nature paper has multiple arrayexpress accessions, I crossed out ones that we probably don't want

The expression data are also available for exploration in scApeX via the link https://bioinf.eva.mpg.de/shiny/sample-apps/scApeX/

There is also the issue that the data we have has been accessioned twice, once by us and once as part of the arrayexpress submission

Acceptance criteria for the task:

ami-day commented 2 years ago

Updating with new donors and samples (2nd submission)

ami-day commented 2 years ago

New submission metadata complete but need to ask @aaclan-ebi about which submission 1 uuids to add.

ami-day commented 2 years ago

The new submission is almost ready. The metadata is uploaded but @aaclan-ebi is going to first delete some errors from an old submission. The raw fastq files are downloading from NCBI.

Wkt8 commented 2 years ago

Put into secondary review when ready!

idazucchi commented 2 years ago

Picking this up for secondary review!

ami-day commented 2 years ago

@idazucchi and I need to discuss this dataset, setting up a meeting for Monday

idazucchi commented 2 years ago

Hello! I think you did a good job on this very complex dataset. As we discussed I have some suggestion for this project: the most important ones are those on the modelling of the organoids and cell suspensions, the rest are suggestions of additional fields you could fill in.

Donor

Collection protocol

Cell line

You could add some information about:

Organoid

Cell suspension:

Library prep

Sequence file

Analysis file

You could add the analysis files available from E-MTAB-7552

Expected number of entities

This is an attempt to check how many files/organoids/cell suspension to expect in the finished spreadsheet

AE project accession #oragnoids #cell suspension #fastqs
E-MTAB-7552 13 12 42 without mixed samples from early organoid development || 2 libraries sequenced twice
E-MTAB-8234 3 125 250 I think it's 3 organoids based on the info in Supplementary table1 - fluidigm
E-MTAB-8089 1008 696 1392 under the assumption that each set of files corresponds to a different organoid
E-MTAB-8230 not eligible: post mortem donors human samples were processed together with ape samples
E-MTAB-8231 30 60 post mortem donors||bulk data
E-MTAB-8228 12 12 24
Total 1036 875 1768

Duplicated Biosample accessions

This is still an open question, I'm not sure of how we want to address it

gabsie commented 2 years ago

@ami-day will finish this for next release.

ami-day commented 2 years ago

Thanks @idazucchi I have made almost all of these changes. About the potentially duplicate biosamples: I decided to remove those biosamples from this submission update. I am now downloading the fastq files so I can fill the file names and upload to ingest.

ami-day commented 2 years ago

upload area: 04dfcbed-8a97-4788-9de2-eacb1f64c7ef/

ami-day commented 2 years ago

SRA file conversion is incredibly slow. No other options to get the raw data. Removing this dataset from release 14 and will tag it with release 15.

prabh-t commented 2 years ago

Ami having issues getting some data (from ENA?). FASTQs not available. Ami and Enrique to have a meeting to look into this.

jacobwindsor commented 2 years ago

This will probably be done today

ami-day commented 2 years ago

Bug in prod. Needs to be fixed (currently fixes are in staging). Then i need to re-upload this submission.

prabh-t commented 2 years ago

Fix is now in prod. Submission cannot be deleted. Alegria to try to delete again.

aaclan-ebi commented 2 years ago

The submission with uuid 8e1cee01-4bd4-4c97-a201-a80fa8ec4b5f is now deleted. I'm not sure if my attempts the other day went thru but when I check today the submission is already gone and there's already a new submission.

ami-day commented 2 years ago

Alegria and I are looking into an issue in ingest.

ami-day commented 2 years ago

syncing the fastq files to ingest prod.

ami-day commented 2 years ago

submitted.

ami-day commented 2 years ago

exported.

MightyAx commented 2 years ago

In attempting to re-export this as part of the r16 missing file descriptor issue, we have had an exporter failure where the project's supplementary files failed to export to terra because they are "DCP1" files:

These supplementary files are exported with every experiment so all the experiments "failed" to export because the file metadata was a different schema in DCP1 and DCP2 metadata has not been stored against these files (file type & size and so on) But the error is kind of irrelevant because we would not have been able to retrieve the file to send to terra since the original HCA S3 bucket no longer operates, so we would not have been able to get the files to add to terra.

Either these supplementary files are important to add with this new dataset and we should:

  1. remove the old supplementary file references
  2. get a copy of the files (either from dcp1 or our wrangling files)
  3. upload the files as supplementary files to the project
  4. reexport

Or the files are not relevant to the new dataset, then we can:

  1. remove the files from the project
  2. reexport

@ami and @yusra-haider, lets talk about this tomorrow.

s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/hipsci-ipsc-pipeline.pdf
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/Dissociation_protocol_130-092-628.pdf
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/CG00052_SingleCell3_ReagentKitv2UserGuide_RevE.pdf
ipediez commented 2 years ago

Do we need the supplementary files or not? A discussion needs to be made between @ami-day and @idazucchi , maybe for release 18 rather than 17 so no priority here

ESapenaVentura commented 2 years ago

I have the 3 supplementary files at hand, let me know if you need them (Also sent them to @idazucchi )

idazucchi commented 2 years ago

Enrique retrieved the missing files from the DCP. We need to discuss how we want to proceed

ofanobilbao commented 2 years ago

@idazucchi @ami-day @yusra-haider @ESapenaVentura to have a discussion on this today

ami-day commented 2 years ago

Hi @MightyAx If these files will still be present associated with the initial dataset when it got submitted but they are not present in the 2nd submission (update) dataset, that is ok. Can they be deleted and the submission re-exported?

idazucchi commented 2 years ago

At the moment the plan is to:

This should allow ingest to find the files and export correctly. This solutions should allow future updates without running again into this issue

please @yusra-haider feel free to edit my explanation to make it more accurate/detailed

@ami-day we can't really delete these files because they are linked to the protocols of the 1st submission, which are reused in the 2nd submission

yusra-haider commented 2 years ago

I've been looking at the code, and it seems it won't be possible to update the cloudURL using the rest api calls, as we have guards against that.

I'm planning on:

this is a one time solution and we need to come up with a more permanent solution for the future.

I'm planning on testing this out on some other environment before doing these steps for this project on prod.

ESapenaVentura commented 2 years ago

Tried:

It's been exporting for a while but the status did not change.

We suspect the dataset is exported, need to investigate

yusra-haider commented 2 years ago

Steps taken to fix this issue:

curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2b"

curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2c"

curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2d"

This is a one off fix and this issue will arise again when we want to update and re-export something in this dataset

yusra-haider commented 2 years ago

The submission shows that it's still stuck in exporting, but we (me and @ESapenaVentura) checked the data on the GCP bucket, and the project has been successfully exported to the staging area.

An import request has been sent for this dataset for Release 17.

There are no error logs in the exporter or core for this submission, and I'm trying to figure out why the submission is stuck in exporting state despite the successful export to the staging area

yusra-haider commented 2 years ago

This check might be the culprit in this case. For this dataset the number of processes in the submission being exported is 1818, whereas the number of processes in the GCP bucket is 1820, because of linkages with the previous submission.

Wkt8 commented 2 years ago

This still needs the state of the dataset to be set to 'exported'

Wkt8 commented 2 years ago

Has been set to 'exported'.

ami-day commented 2 years ago

Was the import form submitted?

Wkt8 commented 2 years ago

Yes. An import form was submitted for Release 17 (on 31st May) specifying New raw data, New contributor-generated matrix, New metadata, Updated metadata

idazucchi commented 2 years ago

Needs an update