Closed mshadbolt closed 2 years ago
Updating with new donors and samples (2nd submission)
New submission metadata complete but need to ask @aaclan-ebi about which submission 1 uuids to add.
The new submission is almost ready. The metadata is uploaded but @aaclan-ebi is going to first delete some errors from an old submission. The raw fastq files are downloading from NCBI.
Put into secondary review when ready!
Picking this up for secondary review!
@idazucchi and I need to discuss this dataset, setting up a meeting for Monday
Hello! I think you did a good job on this very complex dataset. As we discussed I have some suggestion for this project: the most important ones are those on the modelling of the organoids and cell suspensions, the rest are suggestions of additional fields you could fill in.
no
ipsc_induction_protocol_1
collection_protocol_skin
for submission1’s specimens as well?You could add some information about:
bulk_RNA_library_preparation
you could fill the the preparation kit fieldread length
field, the information is available from ArrayExpressYou could add the analysis files available from E-MTAB-7552
This is an attempt to check how many files/organoids/cell suspension to expect in the finished spreadsheet
AE project accession | #oragnoids | #cell suspension | #fastqs | ||
---|---|---|---|---|---|
E-MTAB-7552 | 13 | 12 | 42 | without mixed samples from early organoid development || 2 libraries sequenced twice | |
E-MTAB-8234 | 3 | 125 | 250 | I think it's 3 organoids based on the info in Supplementary table1 - fluidigm | |
E-MTAB-8089 | 1008 | 696 | 1392 | under the assumption that each set of files corresponds to a different organoid | |
E-MTAB-8230 | not eligible: post mortem donors human samples were processed together with ape samples | ||||
E-MTAB-8231 | 30 | 60 | post mortem donors||bulk data | ||
E-MTAB-8228 | 12 | 12 | 24 | ||
Total | 1036 | 875 | 1768 |
This is still an open question, I'm not sure of how we want to address it
@ami-day will finish this for next release.
Thanks @idazucchi I have made almost all of these changes. About the potentially duplicate biosamples: I decided to remove those biosamples from this submission update. I am now downloading the fastq files so I can fill the file names and upload to ingest.
upload area: 04dfcbed-8a97-4788-9de2-eacb1f64c7ef/
SRA file conversion is incredibly slow. No other options to get the raw data. Removing this dataset from release 14 and will tag it with release 15.
Ami having issues getting some data (from ENA?). FASTQs not available. Ami and Enrique to have a meeting to look into this.
This will probably be done today
Bug in prod. Needs to be fixed (currently fixes are in staging). Then i need to re-upload this submission.
Fix is now in prod. Submission cannot be deleted. Alegria to try to delete again.
The submission with uuid 8e1cee01-4bd4-4c97-a201-a80fa8ec4b5f is now deleted. I'm not sure if my attempts the other day went thru but when I check today the submission is already gone and there's already a new submission.
Alegria and I are looking into an issue in ingest.
syncing the fastq files to ingest prod.
submitted.
exported.
In attempting to re-export this as part of the r16 missing file descriptor issue, we have had an exporter failure where the project's supplementary files failed to export to terra because they are "DCP1" files:
These supplementary files are exported with every experiment so all the experiments "failed" to export because the file metadata was a different schema in DCP1 and DCP2 metadata has not been stored against these files (file type & size and so on) But the error is kind of irrelevant because we would not have been able to retrieve the file to send to terra since the original HCA S3 bucket no longer operates, so we would not have been able to get the files to add to terra.
Either these supplementary files are important to add with this new dataset and we should:
Or the files are not relevant to the new dataset, then we can:
@ami and @yusra-haider, lets talk about this tomorrow.
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/hipsci-ipsc-pipeline.pdf
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/Dissociation_protocol_130-092-628.pdf
s3://org-humancellatlas-upload-prod/fce97270-fce0-4744-8a4e-a93d95521852/CG00052_SingleCell3_ReagentKitv2UserGuide_RevE.pdf
Do we need the supplementary files or not? A discussion needs to be made between @ami-day and @idazucchi , maybe for release 18 rather than 17 so no priority here
I have the 3 supplementary files at hand, let me know if you need them (Also sent them to @idazucchi )
Enrique retrieved the missing files from the DCP. We need to discuss how we want to proceed
@idazucchi @ami-day @yusra-haider @ESapenaVentura to have a discussion on this today
Hi @MightyAx If these files will still be present associated with the initial dataset when it got submitted but they are not present in the 2nd submission (update) dataset, that is ok. Can they be deleted and the submission re-exported?
At the moment the plan is to:
This should allow ingest to find the files and export correctly. This solutions should allow future updates without running again into this issue
please @yusra-haider feel free to edit my explanation to make it more accurate/detailed
@ami-day we can't really delete these files because they are linked to the protocols of the 1st submission, which are reused in the 2nd submission
I've been looking at the code, and it seems it won't be possible to update the cloudURL
using the rest api calls, as we have guards against that.
I'm planning on:
this is a one time solution and we need to come up with a more permanent solution for the future.
I'm planning on testing this out on some other environment before doing these steps for this project on prod.
Tried:
It's been exporting for a while but the status did not change.
We suspect the dataset is exported, need to investigate
Steps taken to fix this issue:
[x] remove supplementary files from project
Command used:
curl -X PUT -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles"
[x] set submission back to graph valid state
[x] redeploy state tracker to sync state
[x] delete all the exported files / metadata for this project on GCP, to give exporting a fresh start
[x] re-export choosing the option:
clicked Submit metadata and data to HCA
[x] link the supplementary files back to the project Commands used:
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2b"
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2c"
curl -X PATCH -H "Authorization: Bearer $TOKEN" -H "Content-Type: text/uri-list" "https://api.ingest.archive.data.humancellatlas.org/projects/5cdbdd82d96dad0008592f2a/supplementaryFiles" -d "https://api.ingest.archive.data.humancellatlas.org/files/5cdbdd82d96dad0008592f2d"
This is a one off fix and this issue will arise again when we want to update and re-export something in this dataset
The submission shows that it's still stuck in exporting, but we (me and @ESapenaVentura) checked the data on the GCP bucket, and the project has been successfully exported to the staging area.
An import request has been sent for this dataset for Release 17.
There are no error logs in the exporter or core for this submission, and I'm trying to figure out why the submission is stuck in exporting state despite the successful export to the staging area
This check might be the culprit in this case. For this dataset the number of processes in the submission being exported is 1818, whereas the number of processes in the GCP bucket is 1820, because of linkages with the previous submission.
This still needs the state of the dataset to be set to 'exported'
Has been set to 'exported'.
Was the import form submitted?
Yes. An import form was submitted for Release 17 (on 31st May) specifying New raw data, New contributor-generated matrix, New metadata, Updated metadata
Needs an update
Dataset/group this task is for: ingest
New 2nd submission: https://docs.google.com/spreadsheets/d/1iF0JnTjk89UxZ6HJAu42CdzaXvyfleKPYsLaYjSZNBA/edit#gid=208686926
https://contribute.data.humancellatlas.org/submissions/detail?uuid=6baac52c-93e0-4278-96cf-cd869d501178&project=005d611a-14d5-4fbf-846e-571a1f874f70
This relates to the organoid dataset: https://data.humancellatlas.org/explore/projects/005d611a-14d5-4fbf-846e-571a1f874f70
Description of the task:
As I was collecting contributor matrices I realised that the data that we have has been published in the papers here:
https://www.nature.com/articles/s41586-019-1654-9?draft=collection
The nature paper has multiple arrayexpress accessions, I crossed out ones that we probably don't want
E-MTAB-8043 (single-cell ATAC-seq of chimpanzee organoids)E-MTAB-8083 (single-cell ATAC-seq of bonobo organoids)E-MTAB-8087 (single-cell ATAC-seq of macaque organoids)The expression data are also available for exploration in scApeX via the link https://bioinf.eva.mpg.de/shiny/sample-apps/scApeX/
There is also the issue that the data we have has been accessioned twice, once by us and once as part of the arrayexpress submission
Acceptance criteria for the task: