JovingeLabSoftware / LincServe

A high performance (hopefully) LINCS data server including Node.js server, couchbase backend, and R package for ETL
0 stars 0 forks source link

Calculate zspc (plate control) zscores with our pipeline and confirm matches LINCS published scores #32

Closed ghost closed 8 years ago

ghost commented 8 years ago

May not exactly match due to robust zscore vs. zscore and/or averaging, but should be very highly correlated.

borgmaan commented 8 years ago

Calculations were implemented in 0deba96 and are running now. Method hits couchbase view API instead of REST API - could clean this up in the future, but this is likely a one time calculation.

borgmaan commented 8 years ago

Correlation between the scores is looking good. Here's plot showing correlation of 1181 zpsc scores from our pipeline and the LINCS HDF5 file.

screen shot 2015-12-28 at 10 33 39 pm

Median correlation is quite high at 0.977. There is one poorly correlated sample (id: ZSPC_L1000_NPC_Niclosamide_10_24) with a correlation between the two methods of 0.16. Here's what the scatterplot for that signature looks like:

screen shot 2015-12-28 at 10 39 14 pm

Here's a random sample of 9 signatures to show the strong correlation seen for most comparisons:

screen shot 2015-12-28 at 10 42 35 pm

So things are looking OK at a high level. One other thing I noticed was that our scores are sometimes misbehaving at the extremes. The 2 plots below are the same plot with and without restricted axis limits:

screen shot 2015-12-28 at 10 47 37 pm

ghost commented 8 years ago

Great work. I wonder why they are not exactly the same? Perhaps they did not really use the whole plate (i.e. Maybe they discarded some control wells etc.). But I would say our pipeline is working as expected.

We should open a new issue to look at the extreme score issue.

Sent from my iPhone

On Dec 28, 2015, at 10:50 PM, Andrew Borgman notifications@github.com wrote:

Correlation between the scores is looking good. Here's plot showing correlation of 1181 zpsc scores from our pipeline and the LINCS HDF5 file.

Median correlation is quite high at 0.977. There is one poorly correlated sample (id: ZSPC_L1000_NPC_Niclosamide_10_24) with a correlation between the two methods of 0.16. Here's what the scatterplot for that signature looks like:

Here's a random sample of 9 signatures to show the strong correlation seen for most comparisons:

So things are looking OK at a high level. One other thing I noticed was that our scores are sometimes misbehaving at the extremes. The 2 plots below are the same plot with and without restricted axis limits:

— Reply to this email directly or view it on GitHub.

borgmaan commented 8 years ago

I have been trying to work that out as well. I think that part of the reason might be related to the way that we aggregate our scores. We average across all replicates at the cell X pert X dose X time level. I am not sure how you dumped the test data, but it could be the case that not all replicates were available in the test data for averaging.

This problem also got me thinking of how best to use the replicates for each sample. One of the more interesting ways I remember reading about was the SCOREM paper. The approach was designed to deal with the problem of multiple expression signals resulting from redundant probes on microarrays but might have some traction here. There are probably many other (and better) ways of dealing with this.