Closed Wkt8 closed 3 years ago
emailed specimen provider to see if there are any further details available about the mouse strain or collection protocol.
I have almost finished this. The trickiest part is the nanopore reads, these contain the 10X cell barcodes and UMIs but they aren't necessarily at a pre-determined offset from the end of the read so it is not possible to fill an standard integer for these fields in the library preparation protocol.
This is ready for secondary review, on top of the normal review since this a new method I have a few queries I would like a second opinion on:
Essentially what happened in this experiment was that a 10X library preparation step was performed, some of the library was used for standard 10X sequencing on an Illumina machine, and part of the library was subjected to an additional oxford nanopore library preparation step. So there are the same expected 10X cell barcodes and UMIs as in 10X but they are all located in the one long nanopore read.
For the libraries that were sequenced on nanopore, I chose to specify two library preparation protocols, the 10X then the nanopore protocol.
Since the nanopore reads incorporate the 10X/illumina adapters, we know the cell barcode and umi lengths but we don't know an exact offset. The authors used algorithms to detect the known barcodes/umis rather than relying on a known offset length. As these fields are required by the 'barcode' module, if we can't specify a number here, we also can't specify the length of the cell and umi barcodes. I think the simplest option here would be to not specify anything about the barcodes but have the information in the description field. The other option would involve modifying the barcode schema to allow something like 'unknown' or 'variable' which I am not sure is possible in an integer field. Or to make this field optional in the schema. It would be great to have the secondary wrangler's opinion on this too.
The current library_preparation_protocol schema only allows two values for primer, poly-dT or random. Here they used custom primers to specifically amplify strands from the 10X library. I would like to add the term 'custom' to the enum for this field as this is a minor schema change and other components have specified that enums do not really affect them.
Also requested a new ontology term for the specific mouse strain https://github.com/EBISPOT/efo/issues/1023
I assume right now, we capture in the metadata that the platform is Illumina or nanopore but until this submission everything has always been Illumina
We should flag this to both the pipelines team and the browser team before we export as I suspect there is an assumption sequencing platform doesn't need to be exposed/queryable right now because it is always the same
You can currently query by the sequencing instrument in the browser.
It is possible that by choosing to have two library prep protocols, it will confuse pipelines, because the nanopore might seem analysable if it is also tagged with 10X method, if they don't check the sequencing instrument or the other protocol.
Having two library protocols is also a graph shape that the browser/azul have probably not encountered.
Probably two votes for simplifying into one library protocol for illumina 10x and one for nanopore 10x.
This is a project where having a 'library prep' biomaterial would help model the experiment a bit more accurately.
Please, double check these values for the in the spreadsheet:
library_preparation_protocol.cell_barcode.barcode_length
: 12
--> 16
library_preparation_protocol.umi_barcode.barcode_offset
: 12
--> 16
library_preparation_protocol.umi_barcode.barcode_length
: 8
--> 10
The error messages in this case are not very helpful.
these errors are not unexpected given the novel library preparation method(s):
[ingest_graph_validator.actions.test_action] - ERROR: test [contains_umi_barcode_info.adoc] failed: non-empty result.
[ingest_graph_validator.actions.test_action] - ERROR: test [contains_cell_barcode_info.adoc] failed: non-empty result.
[ingest_graph_validator.actions.test_action] - ERROR: test [paired_end_2_files.adoc] failed: non-empty result.
this error doesn't make sense to me:
[ingest_graph_validator.actions.test_action] - ERROR: test [ensure_seq_libr_prep_protocols.adoc] failed: non-empty result.
false errors:
[ingest_graph_validator.actions.test_action] - ERROR: test [10x_has_more_than_2_files.adoc] failed: non-empty result.
Does secondary wrangler agree with this method or would it be better to have one protocol for nanopore libraries that specifies the 10X then nanopore library prep
I think both options are valid choices, but downstream users might prefer a single combined protocol.
I would opt for making this field optional in the schema.
Adding the term 'custom' would make sense to me.
This is a project where having a 'library prep' biomaterial would help model the experiment a bit more accurately.
I agree.
Thanks for picking up on the barcode errors, I hadn't changed the default values that came across during conversion.
I merged the nanopore library prep into one that says (like the 10X one, then this after) to simplify the structure of the experiment.
I deleted information about the barcode and UMI information to enable the project to be exported and added information to the library prep protocol that specifies this information.
Since the nanopore library prep now incorporates the 10X step, I think having polydT as the primer is semi accurate so I used that for the 'PRIMER' field.
Since making the above changes I now have a valid submission in ingest https://contribute.data.humancellatlas.org/submissions/detail?id=606d3c33dd9aab1232d148e0&project=0d4b87ea-6e9e-4569-82e4-1343e0e3259f
I retested with the graph validator and I agree with what you outlined as false errors.
I think that this ensure_seq_libr_prep_protocols.adoc
was failing because I had two library_prep protocols before but I have since changed it to one.
I think the paired_end test is actually backwards, it tests if paired_end is false there should be 2 files...
The umi test fails by default if there is not umi information, as I didn't fill it in on the nanopore library prep.
If changes I made are all ok I will transfer files to the new valid submission and export when all are valid.
Thanks for the review!
requested cell types from authors Pascal and Kevin
If changes I made are all ok I will transfer files to the new valid submission and export when all are valid.
The changes you made look all right to me.
uploaded matrices to gs google bucket and added to sheet
syncing and validating sequence files.
Nanopore sequencing files failed fastq validation with various errors including:
* ERROR: Unable to determine quality encoding - unknown range [33,123]
* ERROR: Error in file /data/bd73c678-b22e-4a5f-b76c-5fe451161e4e/SRR9008429_1.fastq.gz: line 27828398: header2 wrong. The line should contain only '+' followed by a newline or read name (header1).
I am currently investigating
I downloaded one fastq file that was erroring (SRR9008432_1.fastq.gz) and ran fastq_info
(fastq_utils 0.24.1
from conda) locally on my machine and it passed validation. I am going to assume that perhaps the error is something to do with how the validation is run by the upload service, or because it is using an older version of the tool that was forked a long time ago. I will proceed to download and manually validate the nanopore sequence files then request the files are manually set to valid once complete.
I have completed manually validating all nanopore sequencing files and they all passed validation with the latest version of fastq_utils.
@MightyAx are you able to please set all fastq files for this submission to valid manually so we can export? https://contribute.data.humancellatlas.org/submissions/detail?uuid=bd73c678-b22e-4a5f-b76c-5fe451161e4e&project=0d4b87ea-6e9e-4569-82e4-1343e0e3259f
thanks!
reminder to myself that I also actually shouldn't submit until ontology is released and I add the more specific mouse strain ontology
I will find out how to do that
We used to use this script to kick validating files to valid -> https://github.com/HumanCellAtlas/hca-data-wrangling/blob/master/src/set-to-valid.py
Maybe I can have a go at editing it to set the invalid ones to valid?
actually maybe I am just being impatient, would it be better to update the validation software and do this properly?
Yeah, that script only works on validating
files not invalid
files,
With invalid
files there's no open to send the validEvent
.
There is an option to send the draftEvent
, but ingest throws a 401 when I try it, even when using the Bearer token from Ingest UI.
Perhaps I'm not authorising correctly but I think it's just not an allowed transition from invalid
to draft
.
I'm going to ask other devs for comment.
Alegria set the files to valid and I updated the mouse strain ontology term. I have hit export and waiting for export to complete
This is a HCA Publication which may require metadata schema evolution for capturing Nanopore metadata.
Primary Wrangler: Marion
Secondary Wrangler: Ray
Associated files:
Google Drive: https://drive.google.com/open?id=1uAeGdrbvx644b9QCyBcickw_8h8WKifT&authuser=mshadbolt%40ebi.ac.uk&usp=drive_fs
Project already in ingest here: https://contribute.data.humancellatlas.org/projects/detail?uuid=0d4b87ea-6e9e-4569-82e4-1343e0e3259f
Published study links
Paper: https://www.nature.com/articles/s41467-020-17800-6
Accessioned data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130708
Key Events