ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

GSE130708: High throughput, error corrected Nanopore single cell transcriptome sequencing (HCA Publication) #227

Closed Wkt8 closed 3 years ago

Wkt8 commented 3 years ago

This is a HCA Publication which may require metadata schema evolution for capturing Nanopore metadata.

Primary Wrangler: Marion

Secondary Wrangler: Ray

Associated files:

Google Drive: https://drive.google.com/open?id=1uAeGdrbvx644b9QCyBcickw_8h8WKifT&authuser=mshadbolt%40ebi.ac.uk&usp=drive_fs

Project already in ingest here: https://contribute.data.humancellatlas.org/projects/detail?uuid=0d4b87ea-6e9e-4569-82e4-1343e0e3259f

Published study links

Paper: https://www.nature.com/articles/s41467-020-17800-6

Accessioned data: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130708

Key Events

mshadbolt commented 3 years ago

emailed specimen provider to see if there are any further details available about the mouse strain or collection protocol.

mshadbolt commented 3 years ago

I have almost finished this. The trickiest part is the nanopore reads, these contain the 10X cell barcodes and UMIs but they aren't necessarily at a pre-determined offset from the end of the read so it is not possible to fill an standard integer for these fields in the library preparation protocol.

mshadbolt commented 3 years ago

This is ready for secondary review, on top of the normal review since this a new method I have a few queries I would like a second opinion on:

1. The linking of the library preparation protocols

Essentially what happened in this experiment was that a 10X library preparation step was performed, some of the library was used for standard 10X sequencing on an Illumina machine, and part of the library was subjected to an additional oxford nanopore library preparation step. So there are the same expected 10X cell barcodes and UMIs as in 10X but they are all located in the one long nanopore read.

For the libraries that were sequenced on nanopore, I chose to specify two library preparation protocols, the 10X then the nanopore protocol.

2. barcode/umi offset

Since the nanopore reads incorporate the 10X/illumina adapters, we know the cell barcode and umi lengths but we don't know an exact offset. The authors used algorithms to detect the known barcodes/umis rather than relying on a known offset length. As these fields are required by the 'barcode' module, if we can't specify a number here, we also can't specify the length of the cell and umi barcodes. I think the simplest option here would be to not specify anything about the barcodes but have the information in the description field. The other option would involve modifying the barcode schema to allow something like 'unknown' or 'variable' which I am not sure is possible in an integer field. Or to make this field optional in the schema. It would be great to have the secondary wrangler's opinion on this too.

3. Primer specification

The current library_preparation_protocol schema only allows two values for primer, poly-dT or random. Here they used custom primers to specifically amplify strands from the 10X library. I would like to add the term 'custom' to the enum for this field as this is a minor schema change and other components have specified that enums do not really affect them.

mshadbolt commented 3 years ago

Also requested a new ontology term for the specific mouse strain https://github.com/EBISPOT/efo/issues/1023

lauraclarke commented 3 years ago

I assume right now, we capture in the metadata that the platform is Illumina or nanopore but until this submission everything has always been Illumina

We should flag this to both the pipelines team and the browser team before we export as I suspect there is an assumption sequencing platform doesn't need to be exposed/queryable right now because it is always the same

mshadbolt commented 3 years ago

You can currently query by the sequencing instrument in the browser.

It is possible that by choosing to have two library prep protocols, it will confuse pipelines, because the nanopore might seem analysable if it is also tagged with 10X method, if they don't check the sequencing instrument or the other protocol.

Having two library protocols is also a graph shape that the browser/azul have probably not encountered.

Probably two votes for simplifying into one library protocol for illumina 10x and one for nanopore 10x.

This is a project where having a 'library prep' biomaterial would help model the experiment a bit more accurately.

rays22 commented 3 years ago

Secondary review

Please, double check these values for the in the spreadsheet:

ingest graph validator errors

The error messages in this case are not very helpful.

Q & A

  1. The linking of the library preparation protocols

Does secondary wrangler agree with this method or would it be better to have one protocol for nanopore libraries that specifies the 10X then nanopore library prep

I think both options are valid choices, but downstream users might prefer a single combined protocol.

  1. barcode/umi offset

I would opt for making this field optional in the schema.

  1. Primer specification

Adding the term 'custom' would make sense to me.

Comment

This is a project where having a 'library prep' biomaterial would help model the experiment a bit more accurately.

I agree.

mshadbolt commented 3 years ago

Thanks for picking up on the barcode errors, I hadn't changed the default values that came across during conversion.

I merged the nanopore library prep into one that says (like the 10X one, then this after) to simplify the structure of the experiment.

I deleted information about the barcode and UMI information to enable the project to be exported and added information to the library prep protocol that specifies this information.

Since the nanopore library prep now incorporates the 10X step, I think having polydT as the primer is semi accurate so I used that for the 'PRIMER' field.

Since making the above changes I now have a valid submission in ingest https://contribute.data.humancellatlas.org/submissions/detail?id=606d3c33dd9aab1232d148e0&project=0d4b87ea-6e9e-4569-82e4-1343e0e3259f

I retested with the graph validator and I agree with what you outlined as false errors.

I think that this ensure_seq_libr_prep_protocols.adoc was failing because I had two library_prep protocols before but I have since changed it to one.

I think the paired_end test is actually backwards, it tests if paired_end is false there should be 2 files...

The umi test fails by default if there is not umi information, as I didn't fill it in on the nanopore library prep.

If changes I made are all ok I will transfer files to the new valid submission and export when all are valid.

Thanks for the review!

mshadbolt commented 3 years ago

requested cell types from authors Pascal and Kevin

rays22 commented 3 years ago

If changes I made are all ok I will transfer files to the new valid submission and export when all are valid.

The changes you made look all right to me.

mshadbolt commented 3 years ago

uploaded matrices to gs google bucket and added to sheet

syncing and validating sequence files.

mshadbolt commented 3 years ago

Nanopore sequencing files failed fastq validation with various errors including:

* ERROR: Unable to determine quality encoding - unknown range [33,123]
* ERROR: Error in file /data/bd73c678-b22e-4a5f-b76c-5fe451161e4e/SRR9008429_1.fastq.gz: line 27828398: header2 wrong. The line should contain only '+' followed by a newline or read name (header1).

I am currently investigating

mshadbolt commented 3 years ago

I downloaded one fastq file that was erroring (SRR9008432_1.fastq.gz) and ran fastq_info (fastq_utils 0.24.1 from conda) locally on my machine and it passed validation. I am going to assume that perhaps the error is something to do with how the validation is run by the upload service, or because it is using an older version of the tool that was forked a long time ago. I will proceed to download and manually validate the nanopore sequence files then request the files are manually set to valid once complete.

mshadbolt commented 3 years ago

I have completed manually validating all nanopore sequencing files and they all passed validation with the latest version of fastq_utils.

@MightyAx are you able to please set all fastq files for this submission to valid manually so we can export? https://contribute.data.humancellatlas.org/submissions/detail?uuid=bd73c678-b22e-4a5f-b76c-5fe451161e4e&project=0d4b87ea-6e9e-4569-82e4-1343e0e3259f

thanks!

mshadbolt commented 3 years ago

reminder to myself that I also actually shouldn't submit until ontology is released and I add the more specific mouse strain ontology

MightyAx commented 3 years ago

I will find out how to do that

mshadbolt commented 3 years ago

We used to use this script to kick validating files to valid -> https://github.com/HumanCellAtlas/hca-data-wrangling/blob/master/src/set-to-valid.py

Maybe I can have a go at editing it to set the invalid ones to valid?

mshadbolt commented 3 years ago

actually maybe I am just being impatient, would it be better to update the validation software and do this properly?

MightyAx commented 3 years ago

Yeah, that script only works on validating files not invalid files, With invalid files there's no open to send the validEvent.

There is an option to send the draftEvent, but ingest throws a 401 when I try it, even when using the Bearer token from Ingest UI. Perhaps I'm not authorising correctly but I think it's just not an allowed transition from invalid to draft.

I'm going to ask other devs for comment.

mshadbolt commented 3 years ago

Alegria set the files to valid and I updated the mouse strain ontology term. I have hit export and waiting for export to complete