daynefiler / tcpl

Former repository for the tcpl R package project. New repo:
https://github.com/USEPA/CompTox-ToxCast-tcpl
4 stars 5 forks source link

Robust conc indexing without signif of 1 #11

Open martino165 opened 8 years ago

martino165 commented 8 years ago

Lets try and come up with a clever solution to performing concentration indexing without having to reduce the concentration precision to 1 significant digit

martino165 commented 8 years ago

Can we just wrap the cndx function with a signif of 1 while maintaining the stored concentration as 3 signifs?

daynefiler commented 8 years ago

@martino165, please see the issue-11 branch. I have updated through level 4, and I think the branch is ready for testing. tcplFit now accepts an optional 'cndx' field that is used to calculate the max_mean, max_med, nconc, and nrep values. The function still works as previously if 'cndx' is omitted. mc4 is also updated to pass 'cndx' to tcplFit. In addition to the level 4 changes, I updated how 'cndx' is calculated in mc1 without changing (rounding, as before) the actual concentration values. Similarly, at level 3 I removed the previous rounding (to 1 significant figure), and changed all of the normalization methods that used concentration to use 'cndx' instead.

ericwatt commented 8 years ago

Is there an example of a curve where rounding is an issue and this is designed to fix?

I'll need to update toxboot to index by cndx rather than concentration (and to test with any changes made to tcplFit), which is doable if this is fixing a problem, but just wanted to see how common this comes up. Is there a benefit to on one hand considering them the same concentration (same cndx) but the other maintaining a different concentration?

daynefiler commented 8 years ago

@ericwatt I don't have a specific example; it's not fixing any specific problem with fitting (and in fact, the fitting still uses concentration). The only changes at level 4 are listed in my previous comment, and only apply when 'cndx' is not NULL. Previously, we rounded concentration to 3 significant figures at level 1, then 1 significant figure at level 3. To make the package more generalized we want to stop rounding concentration, but still need a way to "group" concentrations for normalization/transformation and summary statistic purposes.

daynefiler commented 8 years ago

I think we may need to do a more aggressive rounding to do the concentration index calculations at level 1.

ericwatt commented 8 years ago

Well, the fits will change if they were previously being rounded to 1 signif in conc and are now not being rounded. Previously it would be passed as three values of log10(1.3 uM) and now will be log10(1.328), log10(1.345), log10(1.336). It might not change much, but the logc values will be different.

When I sample the data points for bootstrapping, I'm currently grouping by logc (ex: if 3 response values at 0.1 logc, return 3 values sampled from the observed response values at 0.1 logc). If there is a change to rounding, would you suggest grouping by cndx (if 3 response values at cndx 1, return 3 values sampled from the observed cndx 1 values)? If so, I guess I would need to assign them back to the different values of logc for cndx 1 somehow. Alternatively, I could keep bootstrapping by concentration, but if there are cases where concentrations are listed as 1.030343432 logc and 1.03034344 logc, I might want to group those somehow (maybe round within my sampling?). That may be introducing a divergence between how I fit and how fitting is done in mc4, so I would need to think about it some more.

I guess it comes down to the point you made, if you're grouping concentrations for normalization/transformation/summary stats, are you really considering them different concentrations? And if not, is there a need to maintain a large number of signif?

Is there a case where you have concentrations that are different but within 0.1 signif? This seems like the type of problem where some examples would help to understand the issue.

daynefiler commented 8 years ago

Issues have come up with external users not spacing concentrations as widely as the ToxCast program typically does. For example, consider:

signif(c(10, 15, 25, 50, 150, 200), 1)
# [1]  10  20  20  50 200 200

Ideally, concentrations should be log-spaced and chosen such that they are more than 1 significant figure apart from each other, but this is not always the case. There should probably be methods at level 2 for rounding the concentrations to a given significant figure. Adding that functionality would allow users to exactly duplicate previous analyses. We may also want to add a database field for whether to consider concentration indexing during the fitting, like we do with the fit.all field.

To answer your questions, I don't think we need to maintain more than a couple significant figures. However, we have learned the amount of significant figures to maintain will be study design-dependent. For bootstrapping purposes, I would setup the functionality similar to tcplFit -- if the user provides 'cndx' use it for the sampling (or take a mean/median of the concentrations for that concentration index and draw from those**); if 'cndx' is NULL proceed as previously.

Thoughts?

\ For calculating the 'max_med_conc' and 'max_mean_conc' values I took a median of the concentrations for each concentration index and used those values as the "representative" concentrations.

daynefiler commented 8 years ago

@ericwatt I am happy to entertain new ways of implementing this, but we need to develop a workflow that doesn't require the aggressive rounding we did previously. I think we need to spend a little more time thinking about how the concentration indexing is done at level 1. So far I have kept the 3 significant figures we were using previously. I am sure this caused some subtle issues once we rounded to 1 significant figure at level 3, i.e. 1 concentration having more than 1 concentration index.

ericwatt commented 8 years ago

In that case, what was the reason for implementing the rounding in the first place? Are there examples where the concentration is different, but we think it was actually the same? Something like 10.1 uM in one well, 10.12 uM in a different well?

ericwatt commented 8 years ago

Or is it more that if they're close enough, you would like to group them for normalization purposes/max_med calculation even though the concentration was expected to be different?

daynefiler commented 8 years ago

I think the reasoning is more that the we feel the concentrations are effectively the same, like in your first example (you could find many, many examples of this if you went through the ToxCast data). I have actually been mistaken in my thinking -- we were never rounding concentration at level 1. I need to revert the changes to mc1 because they are unnecessary. I did round concentration to calculate concentration index, but those rounded concentrations were never written back to the database.

The only rounding that happened (and was stored) happened at level 3, where we rounded to 1 significant figure. Our underlying assumption was any concentrations within 1 significant figure are effectively the same, especially because we subsequently logged the concentrations with base 10.

martino165 commented 8 years ago

I think one change to doing the conc indexing is to do the indexing based on unique concs across the acid-spid and not by repi... so if rep1 has 0.1, 0.5,1, 5 and rep2 has 1, 5, 10, 50 then you would have cndx 1:6 with 3:4 have both rep1 and rep2.... this should also solve the missing value issue.... where a single data point is missing and it throws off the cndx value for that rep...

martino165 commented 8 years ago

all other implementation looks really good... essentially we have settled on 3 significant digits of the conc as the primary index for cndx....

daynefiler commented 8 years ago

3 significant digits is what we have always used. I am still concerned about the changes in general, because I do not remember _why_ we went to 1 significant figure at level 3. There had to be some reason for doing the additional rounding.

ericwatt commented 8 years ago

I think it's worth finding several dozen edge cases that have this type of concentration issue, whether it's synthetic 'someone might submit this' type data or known datasets (Matt, I think you mentioned some possible assays that could have this), run it with and without this change, and see what the impact is.

I'm fully on board that if the user submits data with concs c(10, 15, 25, 50, 150, 200), we don't want to round those off to c(10, 20, 20, 50, 200, 200). The question is what are the datasets we DO want to round, and what is our expected behavior for these? This will help determine the proper amount of rounding.

As an extreme edge case, lets say the lab is testing their acoustic sampler and run a concentration series with very tightly spaced points, but since it's drop based, discrete differences between each (lets say nM res between 1 nM and 100 nM, 10 nM res between 100 nM and 1 uM, and 1 uM res between 1 uM and 100 uM. Would you want to group these a certain way?

If there is a known concentration issue, can this be corrected during level 0 pre-pipeline? Effectively, leave it up to the experimental scientist to determine their concentration. Is this a carry-over from pre-spidmap, and now largely unneeded?

martino165 commented 8 years ago

I agree with trying to find out the why and some edge cases but I am also suggesting we fundamentally alter the conc index function to work at the spid level rather than the individual rep level.

martino165 commented 8 years ago

I think this is a more robust cndx approach the indexfunc appears to be working just as well but then you just do it based on spid and wllt dat[ , cndx := indexfunc(rndc), by = list(spid, wllt)]

for acid 1 and the wllt=="c" you end up with cndx 1:15 vs 1:8 previously. before these staggered concs would have been overlayed in some weird way..

unique(dat[wllt=="c" & spid == "E2", log10(conc)]) [1] -4.397940 -5.096910 -5.795880 -6.494850 -7.221849 -8.000000 -8.591760 -2.698970 -3.301030 -3.886057 [11] -4.505150 -5.107349 -5.709965 -6.309804 -6.920819

daynefiler commented 8 years ago

Matt, can you please add your indexfunc to the comments? I will incorporate it and we can come up with some tests to run.

I am not understanding why different replicates would end up with different numbers of concentrations. The same spid-acid combinations should have the same concentrations tested. I understand different rounding, or concentration correction between data deliveries, but they should always be the same as we define "replicate." One exception is if the data is not delivered in full, which is less a package problem, and more of a data input problem.

ericwatt commented 8 years ago

One possible reason that replicates might not have the same concs is if an adjustment is made after the first run. Original plate might be 100 nM to 100 uM, but even at 100 nM the resp is 100%. So the next day they scale everything down by 2 log to try and get the curve centered. As more assays are run in house, with the option to pipeline the data the same day they're run, rapid updates become feasible.

martino165 commented 8 years ago

And the results should be the same as before for all "good and complete" data and where there might be holes this approach may be more robust/fail-proof..

I think I got it to work without actually changing the indexfunc ... just the by statement

## Define concentration index
indexfunc <- function(x) as.integer(rank(unique(x))[match(x, unique(x))])
dat[ , cndx := indexfunc(rndc), by = list(rpid)]  #original
dat[ , cndx2 := indexfunc(rndc), by = list(acid, spid, wllt)] #new
## Define replicate index

you can incorporate and we can test for differences (intended and unintended)

daynefiler commented 8 years ago

Eric, I would not consider that example true "replicates," even though they may be inadvertently called replicates in the pipeline. In that case, it might be more useful to call the concentration indexing by replicate, because they are not "true replicates."

ericwatt commented 8 years ago

At some concentrations it would be "true replicates" though, right?

plate A: logc -2, -1, 0, 1, 2 plate B: logc -4, -3, -2, -1, 0

You have replication at -2, -1, and 0

Unless I'm misunderstanding or missing the impact from the term 'replicates'.

daynefiler commented 8 years ago

I think of "replicates" in the experimental sense. You run a chemical 3 times on day 1 and 2 on day 2. You would have an "N" of 2, one with 3 replicates and one with 2 replicates.

krball commented 7 years ago

Not sure where y'all landed on this, but line 59 in mc3.R rounds concentrations down to 1 significant figure. I was running somebody’s gene expression data through the pipeline… their concentration series is c(0.75,1.5,3,6) which turns into c(0.8,2,3,6) in subsequent tcpl levels.

EDIT: Also, just to note, this is kind of a problem because the tcpl documentation says that concentrations are rounded to 3 significant figures.

Also... I'm not putting this in a separate issue because it seems related to your general discusson on concentration indices, but I think there's a problem in the plotFit() routine defining x axis ticks... look at lines 181 and 182: 181 at = axTicks(side = 1), 182 labels = signif(10^axTicks(side = 1), digits = 1),

I think axTicks default behavior does not automatically account for the log scale? Whatever the case, the x-axis was jacked up in plots generated using the 2-fold increasing concentration series described above. I made my own fix by explicitly putting ticks at the logc values with 10^logc labels, replacing lines 181 and 182 with: 181 at = unique(logc), 182 labels = signif(10^unique(logc),digits=3),

I also had to modify the plot domain earlier in the function. Y'all may not be seeing this on the typical concentration series plots that get fed into tcpl (ie c(0.3,1,3,10,30)) because log10(.3) is pretty close to 0.5? Just speculating here.