ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Update S-SUBS-4 with new data #211

Closed rays22 closed 3 years ago

rays22 commented 3 years ago

We need to update an existing project with new data and metadata.

Primary Wrangler: Ray

Secondary Wrangler:

Associated files:

Google Drive: https://drive.google.com/drive/folders/1wPFQvAVYX8_F5rlYlgFIxuI4w-Fmr8J-

Key Events

Please track the below as well as the key events:

  1. Track date first spreadsheet received and final spreadsheet sent by editing ticket to include date next to event.
  2. Track spreadsheet iterations by placing asterisks next to receive spreadsheet event.
  3. Track any metadata issues/tickets made for dataset with a bulleted list of links under received spreadsheet event. Links should be to the ticket in the metadata repo.
mshadbolt commented 3 years ago

hi @rays22 while reviewing ontology annotations for mappings I found that donor 313C is identified with history of asthma but is annotated as brain cancer in the spreadsheet so perhaps this can be updated when you make the other updates to this project.

rays22 commented 3 years ago

72 fastq.gz files were uploaded on 2021-01-04. Total Size: 94.8 GiB I am going to query contributors about the metadata spreadsheet.

rays22 commented 3 years ago

It is an update submission related to https://github.com/HumanCellAtlas/hca-data-wrangling/issues/394 .

rays22 commented 3 years ago

There are 144 uploaded fastq files in the contributor upload area, but no HCA metadata spreadsheet yet.

rays22 commented 3 years ago
rays22 commented 3 years ago

I have uploaded the spreadsheet S-SUBS4_update-2021-01-21.xlsx to the gdrive with the update. I have added collection_protocol.uuid and dissociation_protocol.uuid values for the existing protocols that need to be linked to processes.

rays22 commented 3 years ago
  1. I have uploaded the metadata spreadsheet to Ingest Production.
    • I did not choose the Update submission option, because that option would not add the new entities and it would generate missing entity UUID errors.
    • Metadata validation by Ingest: no metadata validation errors (other than missing data files).
  2. I have uploaded the new fastq data files to the appropriate Ingest upload area.
    • All the new data files passed validation. The status of the submission is Valid .
aaclan-ebi commented 3 years ago

Great news! Thanks @rays22 ! I'll start the archiving by eod.

Many thanks!

aaclan-ebi commented 3 years ago

Sorry, i will hold off and do it tomorrow. @rays22 said this is awaiting review from secondary wrangler.

mshadbolt commented 3 years ago

@rays22 here are my comments from the review:

I wouldn't put a cell type as a an organ_part, here we are looking for a part of an organ such as ventricle for heart, etc, I don't think it makes sense to have a cell as an anatomical part of an organ. I think it is fine to leave the organ part out in these cases

The plate labels shouldn't be there because it isn't plate based sequencing

This seems to be specified incorrectly: library_preparation_protocol.cdna_library_amplification_method.ontology_label

I am concerned there is no way of distinguishing between the gex/hashing/antibody libraries for the citeseq part of the experiment

I don't think the sequence files have been linked correctly by the process id field, shouldn't the process IDs be consistent with the library-prep IDs? If this is true, the lane indices will need to be changed to 1-4 per library prep

I am kind of worried about how the browser will deal with having two library prep protocols, but this is something we can check before we export.

@aaclan-ebi will having two library prep protocols linked to a single process between cell suspension and sequence file cause issues for the archiver?

aaclan-ebi commented 3 years ago

Good catch @mshadbolt, it'll raise an error in the archiver.

aaclan-ebi commented 3 years ago

Do we need to support that scenario?

If we allow 2 lib prep protocols, How would the current mapping for library preparation protocol fields change?

Example field mapping for sequencing experiment:


library_selection:

Get the library_selection value from the following mapping based on the value of library_preparation_protocol.primer mapping:{  "poly-dT": "Oligo-dT",  "random": "RANDOM",}

https://docs.google.com/document/d/1DvF0S9rL0IxnMdsi9P3qUVRt-IOT25KWqjwwIU7C9_s/edit#heading=h.mjzc4vaccum

rays22 commented 3 years ago

@rays22 here are my comments from the review:

Thanks @mshadbolt .

  1. cell type as organ part
    • [x] That was added by the contributor. I have deleted them.

I wouldn't put a cell type as a an organ_part, here we are looking for a part of an organ such as ventricle for heart, etc, I don't think it makes sense to have a cell as an anatomical part of an organ. I think it is fine to leave the organ part out in these cases

  1. plate labels
    • [x] I have deleted them.

The plate labels shouldn't be there because it isn't plate based sequencing

  1. correct label for OBI:0000415
    • [x] Good catch. I have correct it to PCR which is the correct label for OBI:0000415.

This seems to be specified incorrectly: library_preparation_protocol.cdna_library_amplification_method.ontology_label

  1. gex/hashing/antibody libraries
    • [ ] I will come back to this issue and fix it after we have come up with a solution. I guess we can still go ahead with archiving and do updates at a later time to address any metadata issues.

I am concerned there is no way of distinguishing between the gex/hashing/antibody libraries for the citeseq part of the experiment

  1. linking sequence files
    • [x] Thanks @mshadbolt . I have changed the IDs for the library-prep-sequencing processes so that they are consistent with the library-prep IDs and also changed the lane indices to 1-4.

I don't think the sequence files have been linked correctly by the process id field, shouldn't the process IDs be consistent with the library-prep IDs? If this is true, the lane indices will need to be changed to 1-4 per library prep

  1. two library prep protocols
    • [x] I have changed the two protocols into one.

I am kind of worried about how the browser will deal with having two library prep protocols, but this is something we can check before we export. @aaclan-ebi will having two library prep protocols linked to a single process between cell suspension and sequence file cause issues for the archiver?

@aaclan-ebi I have uploaded the corrected update spreadsheet to the gdrive. How should I proceed? Shall I delete the submission from yesterday and upload the new version? What about the data files?

aaclan-ebi commented 3 years ago

@rays22 i'm afraid you'd have to delete the current and redo the spreadsheet upload and sync the files again in a new submission :(

rays22 commented 3 years ago

@rays22 i'm afraid you'd have to delete the current and redo the spreadsheet upload and sync the files again in a new submission :(

That is not a problem. Done. :) No validation errors. @aaclan-ebi : the submission is ready to be archived.

aaclan-ebi commented 3 years ago

I started the archiving last friday however I found out that this dataset is not archived using ingest archiver because the existing study and project are using different aliases. Because of this mismatch, the ingest archiver is trying to create:

1 project, 1 study , 8 samples, 6 sequencing experiment and 6 sequencing runs

new dsp submission: https://submission.ebi.ac.uk/api/submissions/84ac0b87-cac9-4bac-8e4b-31fede9c561b

Manual steps are needed to fix this submission:

  1. Remove the project and study from the DSP submission
  2. update the sequencing experiments to specify the accession instead of the dsp alias to link it to the existing ENA study

I applied the manual steps yesterday and continued with the file upload however this is failing the bam conversion with the following error:

An error occured:  Error: Command failed: /app/fastq/bin/fastq2bam -s 10xV3 -b 600b28d1c9762f5f0dea0bae.bam -1 SIGAB12_S53_L001_R1_001.fastq.gz -2 SIGAB12_S53_L002_R1_001.fastq.gz -3 SIGAB12_S54_L001_R1_001.fastq.gz -4 SIGAB12_S54_L002_R1_001.fastq.gz -5 SIGAB12_S55_L001_R1_001.fastq.gz -6 SIGAB12_S55_L002_R1_001.fastq.gz -7 SIGAB12_S56_L001_R1_001.fastq.gz -8 SIGAB12_S56_L002_R1_001.fastq.gz -9 SIGAB12_S53_L001_R2_001.fastq.gz -10 SIGAB12_S53_L002_R2_001.fastq.gz -11 SIGAB12_S54_L001_R2_001.fastq.gz -12 SIGAB12_S54_L002_R2_001.fastq.gz -13 SIGAB12_S55_L001_R2_001.fastq.gz -14 SIGAB12_S55_L002_R2_001.fastq.gz -15 SIGAB12_S56_L001_R2_001.fastq.gz -16 SIGAB12_S56_L002_R2_001.fastq.gz -17 SIGAB12_S53_L001_I1_001.fastq.gz -18 SIGAB12_S53_L002_I1_001.fastq.gz -19 SIGAB12_S54_L001_I1_001.fastq.gz -20 SIGAB12_S54_L002_I1_001.fastq.gz -21 SIGAB12_S55_L001_I1_001.fastq.gz -22 SIGAB12_S55_L002_I1_001.fastq.gz -23 SIGAB12_S56_L001_I1_001.fastq.gz -24 SIGAB12_S56_L002_I1_001.fastq.gz
/app/fastq/bin/fastq2bam: illegal option -- 5

It seems that it's creating 1 sequencing run per sequencing experiment and each sequencing run has 24 files each. This seems not valid for bam conversion.

I also noticed that the lane index fields in the sequencing run entities have null values and i have to remove them for each seq.run entities to make the metadata valid.

@rays22, do we need to set the lane indices to group files together into multiple sequencing runs per experiment?

rays22 commented 3 years ago

@rays22, do we need to set the lane indices to group files together into multiple sequencing runs per experiment?

Yes, we need to do that. I think the spreadsheet that I uploaded has had the lane indices.

lauraclarke commented 3 years ago

@rays22 just a quick check from a experimental design perspective, has this happened because one library was run on multiple machines? Why are there so many fastq files for each library?

rays22 commented 3 years ago

@alegria: I have uploaded the spreadsheet S-SUBS4_update-2021-01-26.xlsx to the gdrive with the lane indices.

mshadbolt commented 3 years ago

@rays22 you can't have more than one sequence file with the same read_index on the same lane in the same process id, i think there are some mistakes here.

@lauraclarke We see this a lot with 10x because there are 4 sample indices per actual library, and these were sequenced across two lanes

rays22 commented 3 years ago

@rays22 you can't have more than one sequence file with the same read_index on the same lane in the same process id, i think there are some mistakes here.

I think I have misunderstood Marion's secondary review regarding lane indices that should have values 1-4. I have reverted it to the pre-secondary review version where lane 1 == L001 and lane 2 == L002. I have checked the Illumina documentation for their fastq file naming convention and my interpretation of only two lane indices is in agreement with the Illumina documentation. I have also reverted the process IDs as they were before the secondary review so that each sample index defines a separate library_prep-sequencing process ID. This way there are groups of 6 fastq files linked by the across two lanes (3 files per lane). @mshadbolt M If my linking did not look correct, then please, provide an alternative with a spreadsheet example or a diagram of linking entities.

mshadbolt commented 3 years ago

hi @rays22 , we use the lane index in a slightly wrong way for 10x, because of the constraint I specified above.

there can't be more than one sequence_file with the same read_index on the same lane in the same process id, so you will need to artificially add more lanes to account for that, if there 2 lanes with 4 sets of r1, r2, I1, you actually need 8 lanes

rays22 commented 3 years ago

In the current version of the spreadsheet there are no more than one sequence file with the same read index in the same process id on the same lane, because I used 4 different process IDs and there are 2 lanes (that is 3 files in a lane). I will try to add the artificial lane numbers as I understand your suggestion with one process ID per library. Could you kindly check that version?

rays22 commented 3 years ago

I will try to add the artificial lane numbers as I understand your suggestion with one process ID per library. Could you kindly check that version?

@mshadbolt : Could you kindly check if the latest spreadsheet looks correct and agrees to what you are suggesting for lane indices and process IDs?

mshadbolt commented 3 years ago

I think I am quite confused by the structure of the experiment, can we have a call about this tomorrow?

rays22 commented 3 years ago

I think I am quite confused by the structure of the experiment, can we have a call about this tomorrow?

Yes.

mshadbolt commented 3 years ago

Hi, I have reviewed the spreadsheet again as well as getting an understanding of the TotalSeq protocol and I think the current spreadsheet is ok.

rays22 commented 3 years ago

Hi, I have reviewed the spreadsheet again as well as getting an understanding of the TotalSeq protocol and I think the current spreadsheet is ok.

Thanks @mshadbolt . @aaclan-ebi : the latest submission (Jan 26, 2021, 3:51:29 PM) is ready to be archived.

aaclan-ebi commented 3 years ago

Thanks @rays22 and @mshadbolt ! Yup, @yusra-haider and I are on it today.

yusra-haider commented 3 years ago

@rays22 @aaclan-ebi

here's the list of accessions for the dataset (also attached as a file)https://app.zenhub.com/files/261790554/5546f75d-e099-4aa2-bd11-f23b2167c0e8/download:

[{"type": "sample", "accession": "SAMEA8112913"}, {"type": "sample", "accession": "SAMEA8112912"}, {"type": "sample", "accession": "SAMEA8112914"}, {"type": "sample", "accession": "SAMEA8112915"}, {"type": "sample", "accession": "SAMEA8112911"}, {"type": "sample", "accession": "SAMEA8112910"}, {"type": "sample", "accession": "SAMEA8112908"}, {"type": "sample", "accession": "SAMEA8112909"}, {"type": "sequencingExperiment", "accession": "ERX5015140"}, {"type": "sequencingExperiment", "accession": "ERX5015142"}, {"type": "sequencingExperiment", "accession": "ERX5015143"}, {"type": "sequencingExperiment", "accession": "ERX5015141"}, {"type": "sequencingExperiment", "accession": "ERX5015139"}, {"type": "sequencingExperiment", "accession": "ERX5015138"}, {"type": "sequencingRun", "accession": "ERR5211090"}, {"type": "sequencingRun", "accession": "ERR5211078"}, {"type": "sequencingRun", "accession": "ERR5211064"}, {"type": "sequencingRun", "accession": "ERR5211075"}, {"type": "sequencingRun", "accession": "ERR5211082"}, {"type": "sequencingRun", "accession": "ERR5211077"}, {"type": "sequencingRun", "accession": "ERR5211088"}, {"type": "sequencingRun", "accession": "ERR5211080"}, {"type": "sequencingRun", "accession": "ERR5211084"}, {"type": "sequencingRun", "accession": "ERR5211089"}, {"type": "sequencingRun", "accession": "ERR5211086"}, {"type": "sequencingRun", "accession": "ERR5211076"}, {"type": "sequencingRun", "accession": "ERR5211071"}, {"type": "sequencingRun", "accession": "ERR5211069"}, {"type": "sequencingRun", "accession": "ERR5211081"}, {"type": "sequencingRun", "accession": "ERR5211068"}, {"type": "sequencingRun", "accession": "ERR5211063"}, {"type": "sequencingRun", "accession": "ERR5211073"}, {"type": "sequencingRun", "accession": "ERR5211067"}, {"type": "sequencingRun", "accession": "ERR5211083"}, {"type": "sequencingRun", "accession": "ERR5211065"}, {"type": "sequencingRun", "accession": "ERR5211085"}, {"type": "sequencingRun", "accession": "ERR5211092"}, {"type": "sequencingRun", "accession": "ERR5211099"}, {"type": "sequencingRun", "accession": "ERR5211093"}, {"type": "sequencingRun", "accession": "ERR5211070"}, {"type": "sequencingRun", "accession": "ERR5211072"}, {"type": "sequencingRun", "accession": "ERR5211094"}, {"type": "sequencingRun", "accession": "ERR5211097"}, {"type": "sequencingRun", "accession": "ERR5211095"}, {"type": "sequencingRun", "accession": "ERR5211096"}, {"type": "sequencingRun", "accession": "ERR5211098"}, {"type": "sequencingRun", "accession": "ERR5211104"}, {"type": "sequencingRun", "accession": "ERR5211074"}, {"type": "sequencingRun", "accession": "ERR5211079"}, {"type": "sequencingRun", "accession": "ERR5211100"}, {"type": "sequencingRun", "accession": "ERR5211101"}, {"type": "sequencingRun", "accession": "ERR5211102"}, {"type": "sequencingRun", "accession": "ERR5211091"}, {"type": "sequencingRun", "accession": "ERR5211105"}, {"type": "sequencingRun", "accession": "ERR5211106"}, {"type": "sequencingRun", "accession": "ERR5211107"}, {"type": "sequencingRun", "accession": "ERR5211066"}, {"type": "sequencingRun", "accession": "ERR5211087"}, {"type": "sequencingRun", "accession": "ERR5211108"}, {"type": "sequencingRun", "accession": "ERR5211103"}, {"type": "sequencingRun", "accession": "ERR5211062"}, {"type": "sequencingRun", "accession": "ERR5211061"}]

rays22 commented 3 years ago

Thank you @yusra-haider and @aaclan-ebi . I have sent an email with the new archive accessions to Hugo (contributor).

rays22 commented 3 years ago

@jacobwindsor :

rays22 commented 3 years ago
lauraclarke commented 3 years ago

@rays22 I would flag to Daniel Soithos that these files have been changed rather than just added to so he also removed the old file from the browser. the dcp-ops channel is probably the best place for this announcement

rays22 commented 3 years ago

@rays22 I would flag to Daniel Soithos that these files have been changed rather than just added to so he also removed the old file from the browser. the dcp-ops channel is probably the best place for this announcement

I have sent a slack message about it to Daniel.

rays22 commented 3 years ago

This dataset has been exported to the Terra staging area.

rays22 commented 3 years ago

@aaclan-ebi : We need to update the Ingest status (Archived) in Ingest for the latest submission (Jan 26, 2021, 3:51:29 PM). I think it should be Exported now.

aaclan-ebi commented 3 years ago

This submission was successfully exported last monday 15th Feb, but the state tracker had issues so it wasn't able to set the status to Exported.

I've run the ff command to manually set its state to Exported to reflect the correct state.

TOKEN=''
curl -X PUT -H "Authorization: Bearer $TOKEN" https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/60103a81c9762f5f0dea1624/commitExportedEvent
ofanobilbao commented 3 years ago

@rays22 sorry I am a bit confused about this one. I don't know if it should be brokered to SCEA or not. From DCP perspective I understand that this has already been archived and exported to DCP. Is that correct? Should it go to Finished then on the wrangling board? Thanks!

rays22 commented 3 years ago

@ofanobilbao Yes, I am afraid this one is confusing, because there are several tickets (in different repos) related to this same project. The purpose of this particular ticket was to track the updating of this project with new data from the contributors. The update is complete, so I am closing this ticket. There is already a separate ticket to track the brokering of this project to the SCEA: https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/165