HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
65 stars 32 forks source link

Determine how to capture 10X V(D)J libraries in our metadata #1171

Open mshadbolt opened 5 years ago

mshadbolt commented 5 years ago

Description

As a data wrangler wrangling a project with 10X 5' V(d)J data I need to determine the correct way of defining the experiment in our spreadsheet and determine whether any additional fields or ontology terms are needed. Firstly I will need to get a better understanding of what V(D)J sequencing is and how it differs from 10X 5' sequencing. I heard that perhaps @ambrosejcarr and perhaps @TimothyTickle had started thinking about this so please provide any opinions if you have any. Not sure where this lies on the analysis pipelines roadmap and what type of information would need to be captured above and beyond what we would capture from a standard 10X 5' sequencing library **Acceptance Criteria**

e.g.:

mshadbolt commented 5 years ago

Overall understanding of 10X VDJ sequencing so far

Input: dissociated cells as cell suspensions

Output: >=2 library preparations from a single cell suspension.

The cell barcoded cDNA from the cells in the cell suspension is split between the library preparations after amplification. 1 library is standard 10X 5' style gene expression (gex). The other libraries are the result of a PCR enrichment step using primers. There is an Enrichment kit that can be used to enrich for either B cells or T cells. You can enrich for both from the same cell suspension input. This would result in 3 different libraries, 5 prime tag based expression, paired end T cell VDJ sequences, paired end B cell VDJ sequences. The VDJ sequences may either have the same read lengths as the gex libraries or one can do paired end 150bp reads. The cell barcoding and UMI layout is the same and occur at the start of read 1.

Reads from each VDJ enrichment library are assembled or aligned to known VDJ sequences.

The current dataset that I am wrangling (Kylie James Colon Immune cells) has the three libraries as described above.

How to fill relevant metadata fields

Library preparation protocol

Should there be a single library prep protocol for all libraries or a separate one for each library type? The way I have modelled it in my experiment is to have separate protocols for each library, e.g. 10x_v2_5p_gex_library_prep_protocol 10x_v2_5p_vdj_TCR_library_prep_protocol 10x_v2_5p_vdj_Ig_library_prep_protocol This would enable the user to see which sequence file derived from each library. Open to suggestions if this is a good idea or not.

Potential new metadata fields

Should we add a field to the library_preparation_protocol module that would capture the 'enrichment' primers? Does the analysis team need any specific fields added that would need to be captured to enable analysis? (Perhaps too early to tell if pipelines for this data are a long way off)

Potential new Ontology terms

To populate library_preparation_protocol.library_construction_method.ontology we will need a new V(D)J specific term Should the term be something like 10X 5' v2 V(D)J sequencing? Should this sit underneath 10X 5' v2 sequencing ? Should the gex libraries also get this term or only the vdj specific libraries? It seems like so far there are v1 and v1.1 versions of V(D)J libraries, but I'm not sure the difference so not sure if we need to have both ontology terms.

To populate library_preparation_protocol.input_nucleic_acid_molecule.ontology should we request a more specific term to indicate polyA RNA from TCR (T cells) or Ig (B cells) ?

References:

Chromium Single Cell V(D)J Reagent Kits - User Guide 10x-pert Workshop | Characterization of the Tumor Microenvironment with the Chromium Single Cell Imm Sequencing Requirements for Single Cell V(D)J Experimental design for V(D)J libraries Chromium VDJ presentation with nice diagrams

mshadbolt commented 5 years ago

I also have a question about how the sequencing method for this kind of assay type.

The V(D)J libraries that are enriched for either T cells or B cells, in my case the sequencing for these libraries was paired end 150bp on the HiSeq4000, would this be considered 'tag-based single cell RNA sequencing' or 'full length single cell RNA sequencing'

mshadbolt commented 5 years ago

This would ideally be resolved during this sprint so that I have timeline for when I can ingest Kylie's dataset

lauraclarke commented 5 years ago

Can you propose which schemas need to change/which need to be added?

We are going to struggle to get feedback this week from the US given thanksgiving plus any schema changes will take 3 weeks to make it to production

mshadbolt commented 5 years ago

I outlined my understanding and potential way of modelling this type of experiment in the ticket above but you keep asking for a 'proposal' . Is there some specific way you want me to do this?

lauraclarke commented 5 years ago

If you know which schemas need to be edited, make the PR with those edits and we can ask relevant people to review the PR

If you don't know what edits are needed, do you have a plan get answers to the questions so edits can be made?

ESapenaVentura commented 3 years ago

I'll leave this ticket open but we need to include the vdj information in an SOP.

clairerye commented 3 years ago

we should also talk with SCEA when we do this to ensure we model this the same way