USEPA / CompTox-ToxCast-tcpl

US EPA's Toxicity Forecaster (ToxCast) Pipeline. More information on the ToxCast program available here: https://www.epa.gov/comptox-tools/toxicity-forecasting-toxcast
https://cran.r-project.org/package=tcpl
Other
28 stars 12 forks source link

Redesign mc1 concentration indices (cndx) #122

Closed madison-feshuk closed 11 months ago

madison-feshuk commented 1 year ago

Currently to establish the concentration index in mc1, each file is treated separately. Sometime data for the same endpoint/sample may be provided across multiple files. If there's a separate file for high dose data, the cndx will restart and the high dose data will be included with the lowest cndx from first file. This is skewing results if aggregation relies on cndx.

brown-jason commented 1 year ago

@madison-feshuk what is the acid that you're referencing so I can try to replicate the issue.

madison-feshuk commented 1 year ago

acids 2940:2951 are the new Padilla score endpoints

brown-jason commented 1 year ago

See following query where cndx 8 is mixed in with cndx 1 at high concs. select spid,cndx,conc,rval,srcf from mc1 inner join mc0 on mc0.m0id = mc1.m0id where mc1.acid = 2951 and spid = '1208990427' order by spid,conc desc;

bug.txt

brown-jason commented 1 year ago

These are the grouping variables: acid, spid, wllt, srcf, apid, conc.
The proposal is to get rid of the srcf grouping. Is there a situation where a user would test the same sample in 2 plates? Is there a situation where the user would test the same sample but give 2 different well types?

kpaulfriedman commented 1 year ago

I think there is a use case where you'd test the same spid in two sample plates, potentially, run at different times with different source files. But apid should pick that up as a grouping, right?

brown-jason commented 1 year ago

I believe apid should pick that up but are there cases where we have a null apid but determine uniqueness based on srcf? If so then removing the srcf would break cndx in those cases

kpaulfriedman commented 1 year ago

Perhaps default index could include srcf and option to exclude srcf? I suppose you could look for number of spid duplicates with same apid (or null apid) and srcf is the unique-ifier if it is too much detail to have an option to exclude srcf. I think it's possible for instance in a task order I just designed where we will have same spid spread across two apid to get full concentration range and within same srcf. So if we grouped only by apid that would result in two curves for the same spid...

madison-feshuk commented 1 year ago

Changing the ticket subject since I think we might just need to redesign how cndx are established, e.g. methods that would allow user to specify how cndx is set.

With the Padilla processing, I ran into issues given the grouping variables by srcf, apid. coli and/or rowi also seem to be included. As a work around, grouping variables could be made NULL to allow cndx based entirely on conc

madison-feshuk commented 1 year ago

There appears to be a repi limit based on plate size selected. Given a 96 well plate, repi can only 12 before cndx restarts. For this dataset, more replicates were screened at a high dose (cndx=8), but they're being grouped with lowest dose (cndx=1) due to apparent repi limits

image