Annotation of spike-in samples

levitsky commented 4 years ago

There is a currently open PR #321, which annotates a sample spiked with a single protein (as far as I understand) analogous to already annotated PXD001819.

Both annotations use characteristics[protein_coding_gene] to designate the spiked protein and characteristics[concentration of] for its concentration.

I suggest discussing and documenting how spiked samples should be annotated. I think referring to the gene is not ideal. I think that for spiked proteins we need a term to refer to the protein using its Uniprot accession or something like that, since it is a proteomics format for a proteomics repository. Also, protein_coding_gene is not a EFO term.

We probably need a designated column for spike-in proteins.
Perhaps we should also consider spike-in peptides and how to refer to them; this probably means using the sequence?
Then we need to decide if concentration of is good enough to capture the concentration, or we need something else. Would it be ambiguous or is it good enough?
Also, if we repeat these columns for two or more spiked-in proteins, we have the same situation as discussed in #285. But I guess in this case relying on column ordering would be the best way.
Finally, consider other cases for spiked samples:
- spiking with a complex sample (like human spiked with yeast). I think this should be annotated as two separate rows if possible, as discussed in #223; but, we need a way of annotating relative concentrations.
- spiking with a set of peptides, like iRT. This could arguably be important e.g. if we want to be able to re-process the data based on the annotation, but for that we need the sequences.

The latter is closer to the very extreme: annotating individual peptides/proteins as samples. Should we consider this a viable option?

ypriverol commented 4 years ago

@levitsky The spiked sample is a complex problem:

When one spike protein is something we can represent, even handle the changes you proposed. A more generic approach can be more challenging.

This is where the old discussion about the REF to a protocol file makes a lot of sense or more link to external files data structure plays better. Which is what I will vote for. This is complex because most of the projects do not contain only one protein, then we will end-up by adding a lot of protein columns in the SDRF (same case for peptides). If the number is something manageable (less than 5/10) then it is fine, but if we start seeing more than that we need to recommend a link to an external file that contains the values.

Agree with you about using specific terms for this:

I propose to use the key-value pair data structure like:

characteristics[spike protein] -> AC=UNIPROTACCESSION; PC=PRoteinConcentration characteristics[spike peptide]-> PS=PetideSequence;PC=PeptideConcentration;iRT=value

I can add the terms to PRIDE ontology.

levitsky commented 4 years ago

I like the idea with accession/sequence and concentration in key-value pairs in a single column.

Perhaps we could generalize it also to add mixtures? E.g. characteristics[spiked component] -> CT={protein|peptide|mixture|other}; ... where CT is component type, and then have keys for each type, like: AC for proteins, PS for peptides, name and vendor for commercial kits? vendor could also have a special value like "in-house", and for kits maybe a key with external URI for specification? For "other" we could have name and formula keys.

With that I think we will cover all cases, except for a real biological sample being spiked into another. But in this case we can (and should) just use a second row.

Also a common key for all spiked compounds and samples would be concentration. For samples it would be in its own column, I think we can use concentration of, it makes sense if it applies to the whole sample in SDRF rather than the spiked component like we have in current annotations. In [spiked component] the common key could be like SC for spike-in concentration or CC for component concentration. Also I think we need to specify supported units for concentration and use them uniformly in all these cases.

ypriverol commented 4 years ago

I like the idea with accession/sequence and concentration in key-value pairs in a single column.

Perhaps we could generalize it also to add mixtures? E.g. characteristics[spiked component] -> CT={protein|peptide|mixture|other}; ... where CT is component type, and then have keys for each type, like: AC for proteins, PS for peptides, name and vendor for commercial kits? vendor could also have a special value like "in-house", and for kits maybe a key with external URI for specification? For "other" we could have name and formula keys.

With that I think we will cover all cases, except for a real biological sample being spiked into another. But in this case we can (and should) just use a second row.

Also a common key for all spiked compounds and samples would be concentration. For samples it would be in its own column, I think we can use concentration of, it makes sense if it applies to the whole sample in SDRF rather than the spiked component like we have in current annotations. In [spiked component] the common key could be like SC for spike-in concentration or CC for component concentration. Also I think we need to specify supported units for concentration and use them uniformly in all these cases.

Looks nice. @levitsky Can you do a PR to the specification with this proposal?

mlocardpaulet commented 4 years ago

I think characteristics[concentration of] is OK If we also have the characteristics[injected volume] (or similar, not sure about the ontology here). And this should be accompanied with the injection volumes of all the other samples. Alternatively, we could have characteristics[quantity] (which should be the quantity injected - molar or mass). I don't think that the concentration alone is enough.

levitsky commented 4 years ago

Shouldn't it be enough to have either quantity for all components (main sample and added compounds) or concentration for all components?

mlocardpaulet commented 4 years ago

Well, theoretically it would be enough to analyse the data of one data set, but not to integrate it with others. And since the main goal is to be able to re-analyse several PRIDE submissions together we need to have the quantity and not the concentration. So either mass, molarity, or concentration + injection volumes of all samples.

levitsky commented 4 years ago

Does this argument apply to spike-in samples only, or are you saying we need quantity or mass for all annotations? Looks like the point about integration between datasets is not directly tied to spike-in experiments, or am I missing something?

mlocardpaulet commented 4 years ago

In what other context would you have this information? Purified proteins analysed alone? I have no idea of other use-cases, but it could very well be an issue. The important thing is that whatever the context (spiked-in samples or else), you can only directly compare concentrations if you are sure that the injected volume is the same in the runs you compare. If you know the injection volume you are fine because you can calculate the quantity, if you have the quantity it is even better. But if you just have the concentration and no idea whatsoever of the volume injected, it is useless. Does it answer your question?

levitsky commented 4 years ago

Okay, after some off-mike consultation I think I get your point :)

So it looks like for the main sample we want its mass (not sure how molar amount would apply to complex samples), and for the spiked compound we need injected quantity or injected mass. It doesn't seem to make sense to annotate concentration and volume separately if their only use would be to multiply them, I think it's reasonable that the annotator can do multiplication; but forcing them to convert protein quantities to masses may be too much.

If I am correct this time and we agree on this, then the question is, how exactly do we put this in SDRF. Looks like we need something like characteristics[sample amount] or characteristics[injected mass] for the sample, which is probably in mass units, and then for the spiked component we can have either quantity or mass within the key-value structure.

Any input is very much welcome for the column and key name(s) that would make the most sense, and what ontology terms we can use or need to create for sample amount. Also it's very important for parsing to have a short list of allowed units and their designations in the SDRF specification.

mlocardpaulet commented 4 years ago

Great. Regarding mass Vs amount, I am afraid this will depend on the type of experiment and/or the user. And I don't think we have a good reason for preferring one over the other. Maybe other people would have a stronger opinion? I would personally offer both options (if this is possible). In any case, if the sdrf is well filled up, we should find somewhere the compound (peptide sequence or protein ID or chemical composition etc...) that should allow anybody to calculate a MW for re-analysis. My opinion is that in this case it is better to facilitate the work of the person who fills the sdrf (if she/he has the choice between mass or molarity it is faster/easier to copy-paste from an other table than to make the calculation).

levitsky commented 4 years ago

I agree with providing both options for mass and quantity. If there are no objections, we only need to agree on format now.

First suggestion: since we need a separate column for quantity of sample (mass), we could use the same column for amount of spiked compound. For this to work we will need to come up with a good term for PRIDE ontology that would capture it. Something like characteristics[compound quantity]. I'm not sure what the best phrase would be. The value format would be {float} {unit}, with a list of recognized units being something like: g, mg, ng, mol, mmol, nmol, fmol, amol.

What do y'all think?

mlocardpaulet commented 4 years ago

This looks good. So if I understood well, we would have two columns:

characteristics[spiked component] as you described earlier (CT= component type, then some identifiers)
characteristics[compound quantity] with the quantity (in mass or molarity) injected. What happens if there are several spiked-in proteins or peptides with different quantities? Do we add as many pairs of columns as needed? How do we match the component and the quantity? Based on the order of the columns?

levitsky commented 4 years ago

Ah, thank you for this question. Looks like I was confused when I was suggesting this. I think it's definitely better to include the injected quantity for the spiked compound as a key-value pair within spiked component (or spiked compound), and have an extra (required, in case of spikes) column for main sample quantity (injected mass or some term like that). This way, adding another spiked component only adds one column.

levitsky commented 4 years ago

@ypriverol @anjaf Do you have suggestions for an ontology term to use for "used sample amount", in mass units?

levitsky commented 4 years ago

Suggestion from @lisavetasol: use characteristics[mass] (EFO/PATO:0000125).

Do you think it fits the purpose or should we invent a new one?

levitsky commented 4 years ago

The proposed specification changes have been merged. I think what is left to do here is to update two annotations of spiked samples currently present (are there more?):

[ ] PXD001819
[ ] PXD009815

ypriverol commented 4 years ago

@daichengxin can you fix these two datasets.

julianu commented 4 years ago

I will also annotate a spike-in dataset, if we have a consensus on this now. But this is not uploaded yet, I waited for a decision here.

levitsky commented 4 years ago

@daichengxin I have looked at the two datasets that you reannotated, everything looks great to me, except one thing: I expected Sigma Aldrich listed as vendor, not Standards Research Group.

Also, we can possibly add a specification URL (CS key) for UPS1: https://www.sigmaaldrich.com/content/dam/sigma-aldrich/docs/Sigma/Datasheet/2/ups1dat.pdf. Not that it is very much necessary, but we could showcase these annotations as examples in documentation. What do you think?

bigbio / proteomics-sample-metadata

Annotation of spike-in samples #328