Closed rays22 closed 3 years ago
hi @rays22 while reviewing ontology annotations for mappings I found that donor 313C is identified with history of asthma but is annotated as brain cancer in the spreadsheet so perhaps this can be updated when you make the other updates to this project.
72 fastq.gz files were uploaded on 2021-01-04. Total Size: 94.8 GiB I am going to query contributors about the metadata spreadsheet.
It is an update submission related to https://github.com/HumanCellAtlas/hca-data-wrangling/issues/394 .
There are 144 uploaded fastq files in the contributor upload area, but no HCA metadata spreadsheet yet.
I have uploaded the spreadsheet S-SUBS4_update-2021-01-21.xlsx to the gdrive with the update. I have added collection_protocol.uuid
and dissociation_protocol.uuid
values for the existing protocols that need to be linked to processes.
Update submission
option, because that option would not add the new entities and it would generate missing entity UUID errors.Valid
.Great news! Thanks @rays22 ! I'll start the archiving by eod.
Many thanks!
Sorry, i will hold off and do it tomorrow. @rays22 said this is awaiting review from secondary wrangler.
@rays22 here are my comments from the review:
I wouldn't put a cell type as a an organ_part, here we are looking for a part of an organ such as ventricle for heart, etc, I don't think it makes sense to have a cell as an anatomical part of an organ. I think it is fine to leave the organ part out in these cases
The plate labels shouldn't be there because it isn't plate based sequencing
This seems to be specified incorrectly: library_preparation_protocol.cdna_library_amplification_method.ontology_label
I am concerned there is no way of distinguishing between the gex/hashing/antibody libraries for the citeseq part of the experiment
I don't think the sequence files have been linked correctly by the process id field, shouldn't the process IDs be consistent with the library-prep IDs? If this is true, the lane indices will need to be changed to 1-4 per library prep
I am kind of worried about how the browser will deal with having two library prep protocols, but this is something we can check before we export.
@aaclan-ebi will having two library prep protocols linked to a single process between cell suspension and sequence file cause issues for the archiver?
Good catch @mshadbolt, it'll raise an error in the archiver.
Do we need to support that scenario?
If we allow 2 lib prep protocols, How would the current mapping for library preparation protocol fields change?
Example field mapping for sequencing experiment:
library_selection:
Get the library_selection value from the following mapping based on the value of library_preparation_protocol.primer mapping:{ "poly-dT": "Oligo-dT", "random": "RANDOM",}
@rays22 here are my comments from the review:
Thanks @mshadbolt .
I wouldn't put a cell type as a an organ_part, here we are looking for a part of an organ such as ventricle for heart, etc, I don't think it makes sense to have a cell as an anatomical part of an organ. I think it is fine to leave the organ part out in these cases
The plate labels shouldn't be there because it isn't plate based sequencing
This seems to be specified incorrectly: library_preparation_protocol.cdna_library_amplification_method.ontology_label
I am concerned there is no way of distinguishing between the gex/hashing/antibody libraries for the citeseq part of the experiment
I don't think the sequence files have been linked correctly by the process id field, shouldn't the process IDs be consistent with the library-prep IDs? If this is true, the lane indices will need to be changed to 1-4 per library prep
I am kind of worried about how the browser will deal with having two library prep protocols, but this is something we can check before we export. @aaclan-ebi will having two library prep protocols linked to a single process between cell suspension and sequence file cause issues for the archiver?
@aaclan-ebi I have uploaded the corrected update spreadsheet to the gdrive. How should I proceed? Shall I delete the submission from yesterday and upload the new version? What about the data files?
@rays22 i'm afraid you'd have to delete the current and redo the spreadsheet upload and sync the files again in a new submission :(
@rays22 i'm afraid you'd have to delete the current and redo the spreadsheet upload and sync the files again in a new submission :(
That is not a problem. Done. :) No validation errors. @aaclan-ebi : the submission is ready to be archived.
I started the archiving last friday however I found out that this dataset is not archived using ingest archiver because the existing study and project are using different aliases. Because of this mismatch, the ingest archiver is trying to create:
1 project, 1 study , 8 samples, 6 sequencing experiment and 6 sequencing runs
new dsp submission: https://submission.ebi.ac.uk/api/submissions/84ac0b87-cac9-4bac-8e4b-31fede9c561b
Manual steps are needed to fix this submission:
I applied the manual steps yesterday and continued with the file upload however this is failing the bam conversion with the following error:
An error occured: Error: Command failed: /app/fastq/bin/fastq2bam -s 10xV3 -b 600b28d1c9762f5f0dea0bae.bam -1 SIGAB12_S53_L001_R1_001.fastq.gz -2 SIGAB12_S53_L002_R1_001.fastq.gz -3 SIGAB12_S54_L001_R1_001.fastq.gz -4 SIGAB12_S54_L002_R1_001.fastq.gz -5 SIGAB12_S55_L001_R1_001.fastq.gz -6 SIGAB12_S55_L002_R1_001.fastq.gz -7 SIGAB12_S56_L001_R1_001.fastq.gz -8 SIGAB12_S56_L002_R1_001.fastq.gz -9 SIGAB12_S53_L001_R2_001.fastq.gz -10 SIGAB12_S53_L002_R2_001.fastq.gz -11 SIGAB12_S54_L001_R2_001.fastq.gz -12 SIGAB12_S54_L002_R2_001.fastq.gz -13 SIGAB12_S55_L001_R2_001.fastq.gz -14 SIGAB12_S55_L002_R2_001.fastq.gz -15 SIGAB12_S56_L001_R2_001.fastq.gz -16 SIGAB12_S56_L002_R2_001.fastq.gz -17 SIGAB12_S53_L001_I1_001.fastq.gz -18 SIGAB12_S53_L002_I1_001.fastq.gz -19 SIGAB12_S54_L001_I1_001.fastq.gz -20 SIGAB12_S54_L002_I1_001.fastq.gz -21 SIGAB12_S55_L001_I1_001.fastq.gz -22 SIGAB12_S55_L002_I1_001.fastq.gz -23 SIGAB12_S56_L001_I1_001.fastq.gz -24 SIGAB12_S56_L002_I1_001.fastq.gz
/app/fastq/bin/fastq2bam: illegal option -- 5
It seems that it's creating 1 sequencing run per sequencing experiment and each sequencing run has 24 files each. This seems not valid for bam conversion.
I also noticed that the lane index fields in the sequencing run entities have null values and i have to remove them for each seq.run entities to make the metadata valid.
@rays22, do we need to set the lane indices to group files together into multiple sequencing runs per experiment?
@rays22, do we need to set the lane indices to group files together into multiple sequencing runs per experiment?
Yes, we need to do that. I think the spreadsheet that I uploaded has had the lane indices.
@rays22 just a quick check from a experimental design perspective, has this happened because one library was run on multiple machines? Why are there so many fastq files for each library?
@alegria: I have uploaded the spreadsheet S-SUBS4_update-2021-01-26.xlsx
to the gdrive with the lane indices.
@rays22 you can't have more than one sequence file with the same read_index on the same lane in the same process id, i think there are some mistakes here.
@lauraclarke We see this a lot with 10x because there are 4 sample indices per actual library, and these were sequenced across two lanes
@rays22 you can't have more than one sequence file with the same read_index on the same lane in the same process id, i think there are some mistakes here.
I think I have misunderstood Marion's secondary review regarding lane indices that should have values 1-4. I have reverted it to the pre-secondary review version where lane 1 == L001 and lane 2 == L002. I have checked the Illumina documentation for their fastq file naming convention and my interpretation of only two lane indices is in agreement with the Illumina documentation. I have also reverted the process IDs as they were before the secondary review so that each sample index defines a separate library_prep-sequencing process ID. This way there are groups of 6 fastq files linked by the across two lanes (3 files per lane). @mshadbolt M If my linking did not look correct, then please, provide an alternative with a spreadsheet example or a diagram of linking entities.
hi @rays22 , we use the lane index in a slightly wrong way for 10x, because of the constraint I specified above.
there can't be more than one sequence_file with the same read_index on the same lane in the same process id, so you will need to artificially add more lanes to account for that, if there 2 lanes with 4 sets of r1, r2, I1, you actually need 8 lanes
In the current version of the spreadsheet there are no more than one sequence file with the same read index in the same process id on the same lane, because I used 4 different process IDs and there are 2 lanes (that is 3 files in a lane). I will try to add the artificial lane numbers as I understand your suggestion with one process ID per library. Could you kindly check that version?
I will try to add the artificial lane numbers as I understand your suggestion with one process ID per library. Could you kindly check that version?
@mshadbolt : Could you kindly check if the latest spreadsheet looks correct and agrees to what you are suggesting for lane indices and process IDs?
I think I am quite confused by the structure of the experiment, can we have a call about this tomorrow?
I think I am quite confused by the structure of the experiment, can we have a call about this tomorrow?
Yes.
Hi, I have reviewed the spreadsheet again as well as getting an understanding of the TotalSeq protocol and I think the current spreadsheet is ok.
Hi, I have reviewed the spreadsheet again as well as getting an understanding of the TotalSeq protocol and I think the current spreadsheet is ok.
Thanks @mshadbolt . @aaclan-ebi : the latest submission (Jan 26, 2021, 3:51:29 PM) is ready to be archived.
Thanks @rays22 and @mshadbolt ! Yup, @yusra-haider and I are on it today.
@rays22 @aaclan-ebi
here's the list of accessions for the dataset (also attached as a file)https://app.zenhub.com/files/261790554/5546f75d-e099-4aa2-bd11-f23b2167c0e8/download:
[{"type": "sample", "accession": "SAMEA8112913"}, {"type": "sample", "accession": "SAMEA8112912"}, {"type": "sample", "accession": "SAMEA8112914"}, {"type": "sample", "accession": "SAMEA8112915"}, {"type": "sample", "accession": "SAMEA8112911"}, {"type": "sample", "accession": "SAMEA8112910"}, {"type": "sample", "accession": "SAMEA8112908"}, {"type": "sample", "accession": "SAMEA8112909"}, {"type": "sequencingExperiment", "accession": "ERX5015140"}, {"type": "sequencingExperiment", "accession": "ERX5015142"}, {"type": "sequencingExperiment", "accession": "ERX5015143"}, {"type": "sequencingExperiment", "accession": "ERX5015141"}, {"type": "sequencingExperiment", "accession": "ERX5015139"}, {"type": "sequencingExperiment", "accession": "ERX5015138"}, {"type": "sequencingRun", "accession": "ERR5211090"}, {"type": "sequencingRun", "accession": "ERR5211078"}, {"type": "sequencingRun", "accession": "ERR5211064"}, {"type": "sequencingRun", "accession": "ERR5211075"}, {"type": "sequencingRun", "accession": "ERR5211082"}, {"type": "sequencingRun", "accession": "ERR5211077"}, {"type": "sequencingRun", "accession": "ERR5211088"}, {"type": "sequencingRun", "accession": "ERR5211080"}, {"type": "sequencingRun", "accession": "ERR5211084"}, {"type": "sequencingRun", "accession": "ERR5211089"}, {"type": "sequencingRun", "accession": "ERR5211086"}, {"type": "sequencingRun", "accession": "ERR5211076"}, {"type": "sequencingRun", "accession": "ERR5211071"}, {"type": "sequencingRun", "accession": "ERR5211069"}, {"type": "sequencingRun", "accession": "ERR5211081"}, {"type": "sequencingRun", "accession": "ERR5211068"}, {"type": "sequencingRun", "accession": "ERR5211063"}, {"type": "sequencingRun", "accession": "ERR5211073"}, {"type": "sequencingRun", "accession": "ERR5211067"}, {"type": "sequencingRun", "accession": "ERR5211083"}, {"type": "sequencingRun", "accession": "ERR5211065"}, {"type": "sequencingRun", "accession": "ERR5211085"}, {"type": "sequencingRun", "accession": "ERR5211092"}, {"type": "sequencingRun", "accession": "ERR5211099"}, {"type": "sequencingRun", "accession": "ERR5211093"}, {"type": "sequencingRun", "accession": "ERR5211070"}, {"type": "sequencingRun", "accession": "ERR5211072"}, {"type": "sequencingRun", "accession": "ERR5211094"}, {"type": "sequencingRun", "accession": "ERR5211097"}, {"type": "sequencingRun", "accession": "ERR5211095"}, {"type": "sequencingRun", "accession": "ERR5211096"}, {"type": "sequencingRun", "accession": "ERR5211098"}, {"type": "sequencingRun", "accession": "ERR5211104"}, {"type": "sequencingRun", "accession": "ERR5211074"}, {"type": "sequencingRun", "accession": "ERR5211079"}, {"type": "sequencingRun", "accession": "ERR5211100"}, {"type": "sequencingRun", "accession": "ERR5211101"}, {"type": "sequencingRun", "accession": "ERR5211102"}, {"type": "sequencingRun", "accession": "ERR5211091"}, {"type": "sequencingRun", "accession": "ERR5211105"}, {"type": "sequencingRun", "accession": "ERR5211106"}, {"type": "sequencingRun", "accession": "ERR5211107"}, {"type": "sequencingRun", "accession": "ERR5211066"}, {"type": "sequencingRun", "accession": "ERR5211087"}, {"type": "sequencingRun", "accession": "ERR5211108"}, {"type": "sequencingRun", "accession": "ERR5211103"}, {"type": "sequencingRun", "accession": "ERR5211062"}, {"type": "sequencingRun", "accession": "ERR5211061"}]
Thank you @yusra-haider and @aaclan-ebi . I have sent an email with the new archive accessions to Hugo (contributor).
Archived
./links/
/prod/metadata/project/
Archived
.@jacobwindsor :
20210204_S_SUBS4_4x_annotated_donors_UMI_counts_HCA.h5ad
to the appropriate google bucket.455b46e6-d8ea-4611-861e-de720a562ada-AdultHemOrgans_455b46e6-d8ea-4611-861e-de720a562ada.20200821_SUBS4_PairedOrgans_expression_annotated.h5ad
@rays22 I would flag to Daniel Soithos that these files have been changed rather than just added to so he also removed the old file from the browser. the dcp-ops channel is probably the best place for this announcement
@rays22 I would flag to Daniel Soithos that these files have been changed rather than just added to so he also removed the old file from the browser. the dcp-ops channel is probably the best place for this announcement
I have sent a slack message about it to Daniel.
This dataset has been exported to the Terra staging area.
@aaclan-ebi : We need to update the Ingest status (Archived
) in Ingest for the latest submission (Jan 26, 2021, 3:51:29 PM). I think it should be Exported
now.
This submission was successfully exported last monday 15th Feb, but the state tracker had issues so it wasn't able to set the status to Exported.
I've run the ff command to manually set its state to Exported to reflect the correct state.
TOKEN=''
curl -X PUT -H "Authorization: Bearer $TOKEN" https://api.ingest.archive.data.humancellatlas.org/submissionEnvelopes/60103a81c9762f5f0dea1624/commitExportedEvent
@rays22 sorry I am a bit confused about this one. I don't know if it should be brokered to SCEA or not. From DCP perspective I understand that this has already been archived and exported to DCP. Is that correct? Should it go to Finished then on the wrangling board? Thanks!
@ofanobilbao Yes, I am afraid this one is confusing, because there are several tickets (in different repos) related to this same project. The purpose of this particular ticket was to track the updating of this project with new data from the contributors. The update is complete, so I am closing this ticket. There is already a separate ticket to track the brokering of this project to the SCEA: https://github.com/ebi-ait/hca-ebi-wrangler-central/issues/165
We need to update an existing project with new data and metadata.
Primary Wrangler: Ray
Secondary Wrangler:
Associated files:
Google Drive: https://drive.google.com/drive/folders/1wPFQvAVYX8_F5rlYlgFIxuI4w-Fmr8J-
Key Events
Please track the below as well as the key events: