ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Peng He - Human fetal Hindlimb #163

Closed ESapenaVentura closed 5 months ago

ESapenaVentura commented 3 years ago

Related to https://github.com/HumanCellAtlas/hca-data-wrangling/issues/388

mshadbolt commented 3 years ago

As this is already in AE we won't need to specificallly broker it, I will remove label and move to finished

ESapenaVentura commented 3 years ago

Re-opening as there is new data coming! These new data needs to be associated with the previously ingested data.

lauraclarke commented 3 years ago

This has an update associated with #271 but it might not make sense to do it before everything else happens @ESapenaVentura what do you think?

ESapenaVentura commented 3 years ago

@lauraclarke You are completely right.

Updates on this dataset:

I will keep posting updates here as this is moving forward

Wkt8 commented 3 years ago

Enrique handed this over to me.

Visium data has been submitted. Mouse data has been submitted.

There are more human biomaterials to submit to ENA (ERP119958). Project: PRJEB36736. Once we get those accessions then we can give those accessions to ArrayExpress, then the submitted human biomaterials can also be fully submitted to ArrayExpress.

We give the accessions by emailing ArrayExpress curator (Silvie Fexova).

Had a chat with Alegria and will be submitting the additional biomaterials as a submission to this project: https://contribute.data.humancellatlas.org/projects/detail?uuid=b176d756-62d8-4933-83a4-8b026380262f

Then will be following the Archiving SOP.

Wkt8 commented 3 years ago

Following Enrique's Instructions here: https://docs.google.com/document/d/1_KDA0f9PBCG5LGtZ2H7u-Hf5fBtdGbC75dDlm2CrJBw/edit

However there are a number of sequence files in the upload area that are not in the metadata spreadsheet provided.

5478STDY7717493_S1_L001_I1_001.fastq.gz 5478STDY7717493_S1_L001_R1_001.fastq.gz 5478STDY7717493_S1_L001_R2_001.fastq.gz 5478STDY7717494_S1_L001_I1_001.fastq.gz 5478STDY7717494_S1_L001_R1_001.fastq.gz 5478STDY7717494_S1_L001_R2_001.fastq.gz

I created a new upload area: peng-hindlimb-update-only and have removed those six files. Have now created a new submission to the project and will be uploading files.

ESapenaVentura commented 3 years ago

I am taking over from today - Many thanks for looking into this, @Wkt8 ! I have emailed the contributor and asked about those 6 extra files.

aaclan-ebi commented 3 years ago

There are 2 blocking issues for this (linked in this ticket in ZH):

  1. Verify bam files can be converted back to fastq files ebi-ait/hca-ebi-dev-team#434
  2. Fix missing ontology terms in the metadata that we're submitting to the EBI archives ebi-ait/hca-ebi-dev-team#435

For no. 1, there was an issue in using the fastq_utils bin files and we decided to just submit the raw 10x fastq files directly to ENA. It'd be valuable to also learn how to do this for other 10x datasets whose fastq files need to be uploaded to ENA.

Submitting 10x fastq files to ENA: I’ve chatted with Haseeb. And it looks like they haven’t published the documentation for submitting 10x fastq files. He pointed me to the email thread which says there will be changes to the RUN.xml (ENA’s entity for files)

The READ_TYPE would be optional for 1 (‘single’ assumed) or 2 (‘paired’ assumed) Fastq files. Otherwise, it would be mandatory for Fastq files. Would only be supported for Fastq files BTW. The value restriction for READ_TYPE would be: -sample_barcode -cell_barcode -umi_barcode -feature_barcode -single -paired

example XML:

<?xml version="1.0" encoding="UTF-8"?>
<RUN_SET>
  <RUN alias="alias-001">
    <TITLE>title-001</TITLE>
    <EXPERIMENT_REF refname="exp-ref-001" />
    <DATA_BLOCK>
      <FILES>
        <FILE filename="WSSS_END8738160_S1_L001_I1_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
          <READ_TYPE>sample_barcode</READ_TYPE>
        </FILE>
        <FILE filename="WSSS_END8738160_S1_L001_R1_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
          <READ_TYPE>paired</READ_TYPE>
        </FILE>
        <FILE filename="WSSS_END8738160_S1_L001_R2_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
          <READ_TYPE>cell_barcode</READ_TYPE>
        </FILE>
        <FILE filename="WSSS_END8738160_S1_L001_R3_001.fastq.gz" filetype="fastq" checksum_method="MD5" checksum="d8ca81a13acdaa9dbe62cb10c67b2b8b">
          <READ_TYPE>paired</READ_TYPE>
        </FILE>
      </FILES>
    </DATA_BLOCK>
  </RUN>
</RUN_SET>

A question for a wrangler (@Wei?): What would be the READ_TYPE value that we should specify for each file/run?

About submitting the data… we could either submit thru the programmatic way (submitting xml and staging files in ftp , https://ena-docs.readthedocs.io/en/latest/submit/fileprep/upload.html#uploading-files-using-command-line-ftp-client) or via webin-cli (https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html) But, it looks like that there’s an uncertainty using the webin-cli atm. Haseeb said that 10x fastqs are only supported with the new json manifest file format. However, he’s not sure how we can specify multiple fastq files. He’s confirming how to do this and will get back to me as soon as he knows. Here’s the example he’s given me so far.

{
 "study": "ERP013289",
 "sample": "ERS980556",
 "name": "ena-EXPERIMENT-UNIVERSITY OF SOUTHERN CALIFORNIA-25-11-2015-01:02:26:880-65",
 "platform": "ILLUMINA",
 "instrument": "Illumina MiSeq",
 "insert_size": "390",
 "libraryName": "unspecified",
 "library-source": "GENOMIC",
 "library_selection": "PCR",
 "libraryStrategy": "AMPLICON",
 "fastq": {
   "value": "RIL_34.fastq.bz2",
   "attributes": {
     "read_type": ["single", "paired"]
   }
 }
}

TLDR:

Remaining Steps to archive Peng’s data:

  1. Submit new samples using DSP (need to resolve the ontology terms issue first ticket https://github.com/ebi-ait/hca-ebi-dev-team/issues/437 )
  2. Submit sequencing experiment and sequencing runs directly to ENA using webin-cli which needs a json manifest file. (I believe this will create sequencing experiment and sequencing run entities automatically.) Haseeb will confirm within the day if this works.

if webin-cli is not usable, the steps will be:

  1. Submit samples and sequencing experiment thru DSP, get accessions (also needs that ontology term issue to be fixed)
  2. Uploading files via FTP to a Webin upload area and submitting a “run” XML containing the new READ_TYPE file property
aaclan-ebi commented 3 years ago

So, I've got 3 more days to work on this, noting here the target for the remaining days. If we can finish each target earlier, that would be great. Mon - Fix ontology terms issue, confirm read_type in sequencing run metadata , test the submission of 10x fastq files in test env Tue - Get the 54 files to be uploaded and validated in ENA, if this is finished within the day, The accessions should be available by then. Wed - (breathing time :)) )

ESapenaVentura commented 3 years ago

Pending actions:

ESapenaVentura commented 3 years ago

New accessions have been written to project

Wkt8 commented 2 years ago

Some work to be done on the mouse data and Visium data submissions.

idazucchi commented 2 years ago

Amnon suggested to scrap the existing submissions and make a new comprehensive one. This should be possible since the dataset is not published in the DCP, so we don't need to worry about preserving identifiers The main adavantage we'd get from this is ease of maintenance of the dataset since we would no longer be dealing with a DCP1 submission

ofanobilbao commented 1 year ago

As agreed on Ops Review Meeting yesterday, moving this dataset from stalled and into the queue, just above the Kidney datasets

arschat commented 6 months ago

Peng He, asked us to make this project public. It was previously published but Peng asked us keep it private and therefore was deleted. There were some additional human donors later added, as well as Visium and Mouse data.

Current state:

In order to submit to HCA we need to:

arschat commented 5 months ago

Close ticket and continue wrangling in # 1265 to have the current dataset wrangling template.