Closed ThomasMZheng closed 1 year ago
So it looks like this will take more work than anticipated, the datasets from a naive glance look different from each other when you compare their respective buffer and QCs, although the actual human samples overlap decently. Hopefully, after normalization, they will look more similar? I will need to normalize them anyways, although with large outliers, I will most likely need to normalize by the 0.95 percentile, not the max value.
-Note 93 values were removed (Mostly the buffer results which washed out the remaining density plots)
xlim(100,10000)
70 values were from 0 to 100, most likely the buffer values, 23 values were above 10000
The addition of ", y = ..scaled.." to the aes function of ggplot results in a more comprehensible graph
Only 23 values were omited above 10000
Added alpha for better visualization
-28 values were removed that were larger than 750, in fact we could limit to 650 and only lose 8 more values
xlim(0,7500
Here is the scaled density plot of the same protein
Same as above, alpha added
When looking at other proteins, we notice that the distributions are not the same shape between datasets for example SIGLEC12, this makes the scaled density less informative and also means that we need to re-consider if a simple normalization is effective
Have not updated this in a while, had a meeting with Shidong and Lena from SomaLogic and they said that the two datasets were normalized differently which skews all of the data.
At the current state, it is impossible to merge, however they are working on it right now.
-----------------------###-----------------------
Meanwhile, I finally complied a full list of all VAP-BQC-Phenotype Map complete with days since onset, sex, and case.
Just need to wait until Lena and Shidong get back to me now.
Look into a naive analysis of the datasets, see if I can prove that the datasets are significantly different from one another.
Also, look into the range of all of the proteins, find outliers, maybe even create histograms of different proteins based on dataset/outcome, etc