How to model CITE-seq metadata?

rays22 commented 4 years ago

Cellular Indexing of Transcriptomes and Epitopes by Sequencing is a method in which oligonucleotide-labeled antibodies are used to integrate cell surface protein and transcriptome measurements into a single-cell readout. CITE-seq is compatible with existing single-cell sequencing approaches.

A wet lab workflow example is depicted here: https://en.wikipedia.org/wiki/CITE-Seq#/media/File:Structure_of_ADT&_Wetlab_workflow.jpg
A dry lab workflow example is depicted here: https://en.wikipedia.org/wiki/CITE-Seq#/media/File:CITE-Seq_dry_lab_figure.jpg The antibody-oligo conjugates are referred to as Antibody-Derived Tags (ADTs) in the figure.

It is important to note, that CITE-seq can be combined with various scRNAseq library preparation technologies, for example different 10x versions.

Acceptance criteria

[ ] A consensus is reached about how to model CITE-seq metadata.
[ ] The action items needed to implement the model are identified.

rays22 commented 4 years ago

Possible approaches to model CITE-seq

Using a single library_preparation_protocol that is a combination of CITE-seq and 10x protocols.

This approach would require new CITE-seq EFO ontology child terms for each combination of CITE-seq and the appropriate scRNAseq library preparation versions. Each combination would be a cross-product of CITE-seq EFO_0009294 and the appropriate single cell library construction term. It would not require any changes in the HCA metadata schema.

Using two library_preparation_protocols in the same process, one for CITE-seq and another for 10x.

This approach would not require new EFO ontology terms, because the existing CITE-seq term could be combined with any of the existing single cell library construction terms. It would not require any changes in the HCA metadata schema. In the HCA metadata spreadsheet the two library preparation protocols (one for CITE-seq and the other for the other library construction protocol, e.g. 10x v2) would be associated with the appropriate sequence files in the field for library_preparation_protocol.protocol_core.protocol_id using two appropriate identifiers separated by || from the Library preparation protocol tab. The metadata for the two library preparation protocols would be specified on two separate rows under the Library preparation protocol tab. For example, see example-2lib_protocols.xlsx.

Using a new protocol type for CITE-seq in combination with a library_preparation_protocol.

This approach would require EFO to place CITE-seq under the protein_assay branch. It would also require changes in the HCA metadata schema to accommodate a new protocol type for protein_assay_protocol or polypeptide_assay_protocol. We could also capture the Antibody-Derived Tags (ADTs) more formally (not in the form of free text) by changing the metadata schema.

Any other suggestions?

mshadbolt commented 4 years ago

Not sure if that is what you meant by scenario 1, but made this diagram because in my head we should have separate library preps for each type of cite-seq library, typically gex, Hashtag oligotide (HTO) and antibody (ADT). This wouldn't involve changing the current schemas but would involve some edits to the ontology:

we would need to decide whether the gex library created as part of a cite-seq experiment sequenced on 10x is equivalent to any other 10x gene expression experiment, i.e. do we need a specific cite-seq gex term or can we use the existing 10x gex terms
we would need a specific 10X cite-seq HTO library term which would sit underneath both cite-seq and 10x 3' terms (i am assuming this is only done with 3' libraries?
we would need a specific 10x cite-seq antibody library term which would also sit below both cite-seq and 10x 3' terms

citeseq_modeling

Reading this paper has convinced me that this distinction should be made with different library protocols. Essentially the 10x is used to encapsulate cells, part of the created library is split for normal 10X gex, then part of it is mixed with HTOs and ADTs then the libraries are selectively PCRed to produce each of these libraries. I guess you could argue that it is actually two library preps, first regular 10X, then an additional step, but I think this adds complexity without necessarily adding real value to a consumer. I think the most important thing is for people to be able to distinguish each of the three library types and which should be analysed together.

CITE-seq on 10x Genomics instrument Cells were “stained” with Cell Hashing antibodies and CITE-seq antibodies as described for CITE-seq [18]. “Stained” and washed cells were loaded into 10x Genomics Single Cell 3′ v2 workflow and processed according to the manufacturer’s instructions up until the cDNA amplification step (10x Genomics, USA). Two picomoles of HTO and ADT additive oligonucleotides were spiked into the cDNA amplification PCR, and cDNA was amplified according to the 10x Single Cell 3′ v2 protocol (10x Genomics, USA). Following PCR, 0.6X SPRI was used to separate the large cDNA fraction derived from cellular mRNAs (retained on beads) from the ADT- and Cell Hashtag (HTO)-containing fraction (in supernatant). The cDNA fraction was processed according to the 10x Genomics Single Cell 3′ v2 protocol to generate the transcriptome library. An additional 1.4X reaction volume of SPRI beads was added to the ADT/HTO fraction to bring the ratio up to 2.0X. The beads were washed with 80% ethanol, eluted in water, and an additional round of 2.0X SPRI performed to remove excess single-stranded oligonucleotides from cDNA amplification. After final elution, separate PCRs were set up to generate the CITE-seq ADT library (SI-PCR and RPI-x primers) and the HTO library (SI-PCR and D7xx_s). A detailed and regularly updated point-by-point protocol for CITE-seq, Cell Hashing, and future updates can be found at www.cite-seq.com

rays22 commented 3 years ago

Not sure if that is what you meant by scenario 1,

I would be to happy to clarify any specific aspects of scenario 1 as I understand it. I should note that I have listed it, because it had come up in discussions with other wranglers. I have tried to list the full spectrum of suggested models that I am aware of, even those that are not my preferred choices. Personally, I would prefer a model with at least two distinct library preparation protocols to model CITE-seq. Moving forward, I think your suggestion of three protocols is a good model to discuss and refine.

mshadbolt commented 3 years ago

Have you made an example spreadsheet with the scenario you prefer?

Do we need another meeting to decide on what we are going to do?

rays22 commented 3 years ago

Have you made an example spreadsheet with the scenario you prefer?

I did make an example spreadsheet with two library prep protocols (example-2lib_protocols.xlsx), but after reading your suggested model and some method papers and protocols, I think I would also prefer a three-protocol model, so that spreadsheet does represent my current thinking. My understanding is that the cell-hashing protocol can be used independently of the CITE-seq protocol. CITE-seq is to measure protein levels on the surface of single cells. Cell-hashing with barcoded antibodies is based on the same principle as CITE-seq, but it has a different experimental purpose (sample multiplexing, 'super-loading' samples, detection of multiplets, controlling for batch effects(?)). Because the two methods have different purposes, I think we can safely anticipate that cell-hashing will be used independently of CITE-seq. There are other open questions.

We need to decide what the correct input should be for the library_preparation_protocol.input_nucleic_acid_molecule ontology fields.
We can also think about how to capture the list of CITE-seq and the cell-hashing antibodies. Right now I could capture the CITE-seq antibody list only as free text in the library_preparation_protocol.protocol_core.protocol_description field. Do we need a dedicated field for the Ab list?
Do we also need to collect Antibody barcode - Antibody tables? I am not sure if you can tell which Ab barcode represents which antibody without them.

Do we need another meeting to decide on what we are going to do?

Yes, I think we need a meeting to make decisions. Before the meeting, I also think that we need to circulate a proposal for the model to be discussed, and a list of discussion points to make the meeting productive. I think your model would be a good pick to discuss. I would like to think more about some of the details in it. For example, I would like to research and pick possible ontology terms, and consider the advantages or disadvantages of term choices and combining terms. I will add my thoughts as comments later.

mshadbolt commented 3 years ago

ok cool thanks for summarising your thoughts

One option I thought of as I was reading was for the lists of antibodies and barcodes, we could attach them as supplementary files and link to the relevant protocol via the 'document' field.

But yes sounds like we need a bit more work and discussion before we can converge on an appropriate solution.

rays22 commented 3 years ago

REAP-seq: RNA and protein expression assay that also uses DNA-barcoded antibodies together with high-throughput scRNA-seq

It is very similar to CITE-seq. The main difference between CITE-seq and REAP-seq is how the DNA barcode is conjugated to the antibodies.

three-part antibody conjugated oligonucleotide:
1. 33-bp Nextera Read 1 sequence +
2. a unique 8-bp antibody barcode +
3. 24-25 bp poly(dA) sequence that binds to the poly(dT) primer on the 10x Genomics beads. The beads contain the cell barcodes.
Reference: Peterson VM, Zhang KX, Kumar N, et al. Multiplexed quantification of proteins and transcripts in single cells. Nature Biotechnology. 2017 Oct;35(10):936-939. DOI: 10.1038/nbt.3973.

ami-day commented 3 years ago

Hi @rays22 and @mshadbolt, Sorry, coming back to this, I thought we had reached a conclusion more than we had :-/ I think a combination of Ray's 2nd & 3rd model and Marion's other suggestions would be good.

The list of possible library_preparation_protocol.protocol_core.protocol_id ontologies could get very extensive if we combine methods into single terms e.g. CITE-seq 10X v2'. Having 2 (or more) library preparation protocols separated by || in the library protocol tab seems best to me.
Since there are various protein assay types and they can be very similar, it might be useful to cite their published name in library_preparation_protocol.protocol_core.protocol_description and then enter 'oligo-tagged antibody protein expression assay||10X v2 sequencing' in the library_preparation_protocol.protocol_core.method.ontology field (referring to CITE-Seq, REAP-seq, other cell-hashing pipelines). This could be linked to all output file types and we could use the 'content description' in the sequence file tab to distinguish the types.
An extra field in the library preparation protocol tab to list the surface proteins that the antibodies bind to would be good. I'm not sure if the antibodies themselves would be needed also?
Then as Marion said, make a supplemental table available with: antibody barcode, antibody itself, the surface protein it targets.
For the library_preparation_protocol.input_nucleic_acid_molecule field, can we have 2 entries? e.g. protein||polyA RNA? I think we don't need to add the cDNA here, since it isn't a target molecule of interest, but used to measure the target of interest (protein).
Although I think this approach is best, as Ray mentioned the down side is that it requires a new ontology term for 'oligo-tagged antibody protein expression assay' and an extension to the library preparation protocol metadata schema.

What do you think??! shall we talk tomorrow to decide?

rays22 commented 3 years ago

Thanks @ami-day for your comments. They look very useful to me. Yes, we can have a discussion tomorrow or soon after that.

mshadbolt commented 3 years ago

For @ami-day 's suggestion above, I don't understand how you would distinguish libraries that are gex vs hashtag oligos vs antibody. If all three library preparation methods say cite-seq||10x v2, or are you suggesting something like cite-seq||gene expression||10X v2, cite-seq||hashtag oligos||10X v2 etc?

Having multiple ontologies associated with library_preparation_protocol.method will be a major metadata schema change and we aren't really sure when that will be possible.

Changing the input.nucleic_acid_molecule to an array would also be a major schema update.

I don't really agree that a protein is an input molecule in this case, it is an oligo sequence that tagged the protein/antibody, not the protein itself.

ami-day commented 3 years ago

For @ami-day 's suggestion above, I don't understand how you would distinguish libraries that are gex vs hashtag oligos vs antibody. If all three library preparation methods say cite-seq||10x v2, or are you suggesting something like cite-seq||gene expression||10X v2, cite-seq||hashtag oligos||10X v2 etc?

Having multiple ontologies associated with library_preparation_protocol.method will be a major metadata schema change and we aren't really sure when that will be possible.

Changing the input.nucleic_acid_molecule to an array would also be a major schema update.

I don't really agree that a protein is an input molecule in this case, it is an oligo sequence that tagged the protein/antibody, not the protein itself.

@Marion could we not use the sequence_file.file_core.content_description.text field to distinguish the files? maybe expand the ontology terms we can add here to be more specific to hashtag oligos and antibody?

I didn't suggest adding cite-seq in the library preparation method. I think it's important we have a broader term like oligo-tagged antibody protein assay since there are already pipelines very similar to cite-seq and potentially there will be more in future. I also think it is clearer to someone who is not familiar with cite-seq specifically.

Having multiple ontologies associated with library_preparation_protocol.method will be a major metadata schema change: yes, but I think we need to be able to do this. Otherwise, we will be adding new combinations of ontology terms every time a new pipeline is given a name, and this happens often.

I wasn't completely sure about the input.nucleic_acid_molecule field, so I agree to whatever you think is right here.

ami-day commented 3 years ago

Let's discuss when @ESapenaVentura is back next week

ESapenaVentura commented 3 years ago

@ESapenaVentura to set up a meeting to discuss about this

ESapenaVentura commented 3 years ago

Key points of the discussion:

We have agreed to still use 10x/whatever library prep method ontology is used alongside another ontology term to indicate protein surface assay
- In order to do so, there will be a metadata schema change needed. @ESapenaVentura will write the ticket
- @ESapenaVentura will also create a ticket for further discussion about the ontology term
In order to know which library preparation is which (ADT, GEX, HTO), we will need another change to the sequence_file schema. This change will be minor. @ESapenaVentura to open a ticket to start this discussion

ESapenaVentura commented 3 years ago

https://github.com/HumanCellAtlas/metadata-schema/issues/1338 Metadata schema ticket for the changes needed for method.

ESapenaVentura commented 3 years ago

Key points of the discussion:

We have agreed to still use 10x/whatever library prep method ontology is used alongside another ontology term to indicate protein surface assay

In order to do so, there will be a metadata schema change needed. @ESapenaVentura will write the ticket

@ESapenaVentura will also create a ticket for further discussion about the ontology term

In order to know which library preparation is which (ADT, GEX, HTO), we will need another change to the sequence_file schema. This change will be minor. @ESapenaVentura to open a ticket to start this discussion

We have decided to go with a mix of the first point and the terms suggested here

We will need the schema change to allow for arrays in the "method" field, but other than that the new ontology terms will allow to capture CITE-Seq experiments and, as a bonus, using the terms in the ticket may help to broker the datasets to SCEA more easily in the future.

ofanobilbao commented 3 years ago

If agreement has been reached on how to model CITE-seq metadata, shouldn't this ticket be considered done? I can't remember why we moved it to In Progress instead during the meeting. @ami-day @ESapenaVentura @Wkt8 @rays22 ? Was it because someone needed to capture the highlights somewhere else? Thanks!

ami-day commented 3 years ago

@ESapenaVentura can we close this ticket? did you record this information elsewhere e.g. in the metadata schema updates ticket?

ebi-ait / hca-ebi-wrangler-central