Closed ESapenaVentura closed 5 months ago
As this is already in AE we won't need to specificallly broker it, I will remove label and move to finished
Re-opening as there is new data coming! These new data needs to be associated with the previously ingested data.
This has an update associated with #271 but it might not make sense to do it before everything else happens @ESapenaVentura what do you think?
@lauraclarke You are completely right.
Updates on this dataset:
I will keep posting updates here as this is moving forward
Enrique handed this over to me.
Visium data has been submitted. Mouse data has been submitted.
There are more human biomaterials to submit to ENA (ERP119958). Project: PRJEB36736. Once we get those accessions then we can give those accessions to ArrayExpress, then the submitted human biomaterials can also be fully submitted to ArrayExpress.
We give the accessions by emailing ArrayExpress curator (Silvie Fexova).
Had a chat with Alegria and will be submitting the additional biomaterials as a submission to this project: https://contribute.data.humancellatlas.org/projects/detail?uuid=b176d756-62d8-4933-83a4-8b026380262f
Then will be following the Archiving SOP.
Following Enrique's Instructions here: https://docs.google.com/document/d/1_KDA0f9PBCG5LGtZ2H7u-Hf5fBtdGbC75dDlm2CrJBw/edit
However there are a number of sequence files in the upload area that are not in the metadata spreadsheet provided.
5478STDY7717493_S1_L001_I1_001.fastq.gz 5478STDY7717493_S1_L001_R1_001.fastq.gz 5478STDY7717493_S1_L001_R2_001.fastq.gz 5478STDY7717494_S1_L001_I1_001.fastq.gz 5478STDY7717494_S1_L001_R1_001.fastq.gz 5478STDY7717494_S1_L001_R2_001.fastq.gz
I created a new upload area: peng-hindlimb-update-only and have removed those six files. Have now created a new submission to the project and will be uploading files.
I am taking over from today - Many thanks for looking into this, @Wkt8 ! I have emailed the contributor and asked about those 6 extra files.
There are 2 blocking issues for this (linked in this ticket in ZH):
For no. 1, there was an issue in using the fastq_utils bin files and we decided to just submit the raw 10x fastq files directly to ENA. It'd be valuable to also learn how to do this for other 10x datasets whose fastq files need to be uploaded to ENA.
Submitting 10x fastq files to ENA: I’ve chatted with Haseeb. And it looks like they haven’t published the documentation for submitting 10x fastq files. He pointed me to the email thread which says there will be changes to the RUN.xml (ENA’s entity for files)
The READ_TYPE would be optional for 1 (‘single’ assumed) or 2 (‘paired’ assumed) Fastq files. Otherwise, it would be mandatory for Fastq files. Would only be supported for Fastq files BTW. The value restriction for READ_TYPE would be: -sample_barcode -cell_barcode -umi_barcode -feature_barcode -single -paired
example XML:
<?xml version="1.0" encoding="UTF-8"?>
<RUN_SET>
<RUN alias="alias-001">
<TITLE>title-001</TITLE>
<EXPERIMENT_REF refname="exp-ref-001" />
<DATA_BLOCK>
<FILES>
<FILE filename="WSSS_END8738160_S1_L001_I1_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
<READ_TYPE>sample_barcode</READ_TYPE>
</FILE>
<FILE filename="WSSS_END8738160_S1_L001_R1_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
<READ_TYPE>paired</READ_TYPE>
</FILE>
<FILE filename="WSSS_END8738160_S1_L001_R2_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
<READ_TYPE>cell_barcode</READ_TYPE>
</FILE>
<FILE filename="WSSS_END8738160_S1_L001_R3_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
<READ_TYPE>paired</READ_TYPE>
</FILE>
</FILES>
</DATA_BLOCK>
</RUN>
</RUN_SET>
A question for a wrangler (@Wei?): What would be the READ_TYPE value that we should specify for each file/run?
About submitting the data… we could either submit thru the programmatic way (submitting xml and staging files in ftp , https://ena-docs.readthedocs.io/en/latest/submit/fileprep/upload.html#uploading-files-using-command-line-ftp-client) or via webin-cli (https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html) But, it looks like that there’s an uncertainty using the webin-cli atm. Haseeb said that 10x fastqs are only supported with the new json manifest file format. However, he’s not sure how we can specify multiple fastq files. He’s confirming how to do this and will get back to me as soon as he knows. Here’s the example he’s given me so far.
{
"study": "ERP013289",
"sample": "ERS980556",
"name": "ena-EXPERIMENT-UNIVERSITY OF SOUTHERN CALIFORNIA-25-11-2015-01:02:26:880-65",
"platform": "ILLUMINA",
"instrument": "Illumina MiSeq",
"insert_size": "390",
"libraryName": "unspecified",
"library-source": "GENOMIC",
"library_selection": "PCR",
"libraryStrategy": "AMPLICON",
"fastq": {
"value": "RIL_34.fastq.bz2",
"attributes": {
"read_type": ["single", "paired"]
}
}
}
TLDR:
Remaining Steps to archive Peng’s data:
if webin-cli is not usable, the steps will be:
So, I've got 3 more days to work on this, noting here the target for the remaining days. If we can finish each target earlier, that would be great. Mon - Fix ontology terms issue, confirm read_type in sequencing run metadata , test the submission of 10x fastq files in test env Tue - Get the 54 files to be uploaded and validated in ENA, if this is finished within the day, The accessions should be available by then. Wed - (breathing time :)) )
Pending actions:
New accessions have been written to project
Some work to be done on the mouse data and Visium data submissions.
Amnon suggested to scrap the existing submissions and make a new comprehensive one. This should be possible since the dataset is not published in the DCP, so we don't need to worry about preserving identifiers The main adavantage we'd get from this is ease of maintenance of the dataset since we would no longer be dealing with a DCP1 submission
As agreed on Ops Review Meeting yesterday, moving this dataset from stalled and into the queue, just above the Kidney datasets
Peng He, asked us to make this project public. It was previously published but Peng asked us keep it private and therefore was deleted. There were some additional human donors later added, as well as Visium and Mouse data.
Current state:
9b7857fa-ffcc-408f-ab64-817cfae41efc
ee4af2bf-bb84-489c-8e84-a54fab986844
7915490a-9364-48df-afc5-2f48034720fd
3593127e-2cbd-41f0-8cb0-4cc8c4f0968c
In order to submit to HCA we need to:
Close ticket and continue wrangling in # 1265 to have the current dataset wrangling template.
Related to https://github.com/HumanCellAtlas/hca-data-wrangling/issues/388