No detailed information about SomaLogic proteomics data

peng-gang commented 2 years ago

I am working on a SomaLogic proteomics dataset from human. I find that many proteins are not from human. There are proteins from mouse and jellyfish. Are these proteins used for quality control?

Another question is there are multiple entries for one protein. For example, there are 4 entries (SeqID: 15343-337, 19631-13, 4918-21, and 7784-1) in my data for protein "P01042". Two of them (4918-21, and 7784-1) have the same target: "Kininogen, HMWt. However, the correlation fo the two is not very high (correlation coefficient = 0.75). Shall I use the mean of the proteins with same target for the following analysis?

In the example codes of the package, all these proteins are treated in the same way.

There are many columns of annotation data from function "getAnalyteInfo". Some columns are straightforward, but some are not. For example, "ColCheck", shall I only use the protein with "PASS" in this column?

There should be a more detailed document about the data.

wschwarzmann commented 2 years ago

Hello! I'm from SomaLogic's Global Scientific Engagement team. I've discussed your questions with some members of our Bioinformatics team, and here is our response:

There are some SOMAmer Reagents that are our controls and some that target non-human proteins. To just analyze the 7288 analytes that target human proteins, you'll want to select Type="Protein" and Organism = "Human".
Some SOMAmer reagents bind to a different epitope on the same target protein as another. SOMAmer reagents with the same target may have a different pattern if one of those epitopes is less available.
ColCheck refers to if the SOMAmer reagent measurements in our QC replicates pass our metrics. In your SomaScan Quality Statement file, you can find a breakdown of these under the SOMAmers in Tails section. SOMAmers In Tails refers to the cumulative number of SOMAmer reagents in the QC control with a ratio on any plate outside the accepted accuracy range, 0.8 - 1.2, when compared to the reference. Flagged SOMAmer reagents are typically retained for analyses since accuracy across all assay runs is a robust quality metric but is not a requirement for identification of meaningful biological signal.

If you have any further questions or need more explanation, feel free to reach out to your sales rep and we can set up a tech support call with you. We can also provide you with a file that has a breakdown of the adat format and a description of each column. We appreciate your feedback, and are working on adding column descriptions in the documentation here.

peng-gang commented 2 years ago

Thank you so much for your reply. I just got data from my collaborator. I will ask him for the sales rep's information later.

One more question, if two SOMAmer reagents have same target, why the correlation coefficient is relatively low. Both "4918-21" and "7784-1" target "Kininogen, HMW". However, the correlation coefficient is only 0.75 as shown below. If we remove outliers around 0, the correlation coefficient would be only 0.2.

Kininogen

wschwarzmann commented 2 years ago

It's possible that these particular SOMAmer Reagents are targeting different epitopes with different availability, or different constructs of the protein (fragments, isoforms, heterodimers, etc). Availability of an epitope can be affected by several things: modifications to a protein, SNPs, or interference from a competitor. We recommending running your analysis and seeing which SOMAmer Reagent measurement is significant downstream. We can explore your data and examine this specific instance if we can create a case through your collaborator's sales rep.

peng-gang commented 2 years ago

I see. Thanks.

stufield commented 2 years ago

In addition to the answers above, in the next release of SomaDataIO the following table will be available in the documentation under the Col.Meta/Annotations help:

?colmeta
?annotations

Col.Meta

Field	Description	Example
SeqId	SomaLogic sequence identifier	2182-54_1
SeqidVersion	Version of SOMAmer sequence	2
SomaId	Target identifier, of the form SLnnnnnn (8 characters in length)	SL000318
TargetFullName	Target name curated for consistency with UniProt name	Complement C4b
Target	SomaLogic Target Name	C4b
UniProt	UniProt identifier(s)	P0C0L4 P0C0L5
EntrezGeneID	Entrez Gene Identifier(s)	720 721
EntrezGeneSymbol	Entrez Gene Symbol names	C4A C4B
Organism	Protein Source Organism	Human
Units	Relative Fluorescence Units	RFU
Type	SOMAmer target type	Protein
Dilution	Dilution mix assignment	0.01%
PlateScale_Reference	PlateScale reference value	1378.85
CalReference	Calibration sample reference value	1378.85
medNormRef_ReferenceRFU	Median normalization reference value	490.342
CalV4\<YY>\<SSS>\<PPP>	Calibration scale factor (for given year-study-plate)	0.64
ColCheck	QC acceptance criteria across all plates/sets	PASS
QcReference_\<LLLLL>	QC sample reference value (for given QC lot)	PASS
CalQcRatioV4\\<SSS>\<PPP>	Post calibration median QC ratio to reference (for given year-study-plate)	1.04

SomaLogic / SomaDataIO

No detailed information about SomaLogic proteomics data #12

Col.Meta