Map: Identify all variables that require 'Plan D' annotations

d-callan commented 1 year ago

We have entities like Sample and Assay which have records representing multiples specimens. Plan D involves doing things like incorporating weighted means into our repertoire of stats, but that will only work for those variables which do actually represent specimens. Ex: the pathogen detection assay has variables like pathogen prevalence (the percent of specimens positive for some pathogen in a 'pool') and pathogen presence (values are like 'present' or 'absent' for the enitre pool, the most you can say is at least one specimen was positive). Plan D will work for variables like pathogen prevalence, which represents data about specimens, but not for pathogen presence which represents data about pools.

We need to know, so we can annotate, all terms which are not appropriate for Plan D.

d-callan commented 1 year ago

lol. sorry, i should probably put this somewhere else... hmm.

bobular commented 1 year ago

Controversial thought: In the popbio data, Plan D only applies to the surveillance samples. These are the only unbiased samples where the specimen count is of relevance to visualization/analysis.

Specimen counts for insecticide resistance assays are good to know, but if an assay has 300 input specimens, the output of that assay (e.g. a percent mortality to insecticide X) is not 10x more important than an assay with 30 input specimens.

Same goes for genotyping, pathogen and blood meal host assays, as far as I can tell.

d-callan commented 1 year ago

while its true that sometimes you weight values to 'importance', thats not really my concern here. to me this is a per-variable consideration, not per-entity. What we need to know is, for some variable on say an assay entity, does the variable fundamentally represent a measure of the assay or the specimens? take this ex:

you have pathogen prevalence (as a percent or decimal representation, w/e) which was found by assaying a bunch of specimens and summing the positive, dividing by the total. That is a measure on specimens, and the ONLY statistically valid way to produce a summary value on that variable (which itself fundamentally represents a summary value of specimens) is to weight them by the number of specimens they represent. to give this example some numbers, if it helps:

Assay X had 2 input specimens and 100% pathogen prevalence. Assay Y had 98 input specimens and 1.02 pathogen prevalence (1 positive). If i average over the assays without weighting i get an mean pathogen prevalence of ~50%, when in reality its 3%.

(in other words, it doesnt matter here whether the number of input specimens is above some minimum threshold for inclusion/ statistical significance. it matters that we have any two groups w differing numbers of input specimens when the variable is a summary statistic of specimens.)

bobular commented 1 year ago

I totally agree with that and thanks for the example. I think what was at the back of my mind is the following scenario:

Study A did two assays: 25 specimens use in Assay X, 30 specimens used in Assay Y. Study B also did two assays: 100 specimens in Assay P, and 110 in Assay Q.

Within a study, I don't see much of a problem weighting by specimen count, but when presenting data across multiple studies (as in the megastudy), Study B shouldn't get weighted ~3x more than Study A. (Assuming that ~25 specimens was enough and 100 was overkill.)

A proper statistical comparison would consider the "mixed effects" of Study and Assay, and probably at other levels too. I don't think we have to attempt any of that - but we do need to provide adequate disclaimers. I don't think we'll get sued for showing crudely summarised data on markers on the map, but (floating) boxplots could be misleading.

d-callan commented 1 year ago

I agree w the point that we aren't accounting for differences between studies at all in plan d. To me though, that train of thought leads to questions around use cases for the megastudy and when and how we should make megastudies.

bobular commented 11 months ago

Done for Plan B part 1 at least (surveillance samples). Annotations added to owl file.

VEuPathDB / EdaLoadingIssues

Map: Identify all variables that require 'Plan D' annotations #66