Big-Life-Lab / PHES-ODM

The Public Health Environmental Surveillance Open Data Model (PHES-ODM, or ODM). A data model, dictionary and support tools for environmental surveillance.
Creative Commons Attribution Share Alike 4.0 International
54 stars 18 forks source link

concentration measurements replicates vs mean/SD #56

Closed davidchampredon closed 3 years ago

davidchampredon commented 3 years ago

From a data analysis an statistical point of view, we would greatly benefit from having the concentration measurements of all the replicates rather than a summary statistic (e.g. mean, std dev). This "issue" is a formal request to have replicates measurements as the minimal data set. Thanks!

hswerdfe commented 3 years ago

@DougManuel it might be useful to assign Vince to this once we get his username linked to this repo.

vipileggi commented 3 years ago

@davidchampredon this was my initial view as well but as I thought more about the errors associated with: (1) sampling the wastewater or sludge; (2) lab analysis (3) wastewater surveillance modeling (WWSM) for trends and (4) epi modeling predictions, I did not think replicate values would improve the predictive value of generated results. I think in all cases the error associated with lab analysis (replicates vs mean) is significantly less compared to other inherent errors. I think we have an opportunity (maybe obligation) to test this out based on available data sets and current data analysis/modeling approaches if we intend on making 'replicate' reporting a requirement. I suggest we use the current Ottawa data (ww_virus.csv) and possible one other data set (from NML?) to test this hypothesis. I can ask Robert Delatolla to provide complete replicate data for the last 30 days to test this issue (I assume 30 data points of N1 and N2 both replicates and mean/sd should be adequate?). In terms of WWS analysis I'm leaning towards 2 approaches: (1) EDC recommendations https://www.cdc.gov/coronavirus/2019-ncov/cases-updates/wastewater-surveillance/data-reporting-analytics.html for analysis of WWS data to assist PHUs to make better informed decisions and (2) WWS modeling approaches; actively searching for approaches (link(s): https://www.nature.com/articles/s41587-020-0684-z What do you suggest? For EPI modeling there are a number of models available but what do you suggest? Would like to hear more about your specific concerns and if you agree if this approach may provide some concrete support to either argument. In the interim I will ask for the 'replicate data' from Robert paraphrasing this issue you raise.

vipileggi commented 3 years ago

@davidchampredon I've requested 'replicate data' from Robert Delatolla but not sure If we're going to receive the data in a timely fashion. I suggest we proceed (if you agree) with an assessment using the NML 'replicate value'. It was suggested by @DougManuel (another discussion) that the university labs are not comparable to NML and may be challenged with additional workflow requirements. Also my understanding is that private production labs are beginning to set up to take on WWS in Ontario in the near future and enhanced reporting requirements would definitely be requested of them similar to what NML reports.

davidchampredon commented 3 years ago

thanks @vipileggi !

I do agree that there are several sources of uncertainty, and I think having the replicates available would help to assess uncertainty coming from lab analysis (your point 2/) and to some extent from sampling (point 1/).

What motivates raising this issue is to improve our ability to capture the uncertainty associated with normalized concentration measurements when analyzing their times series. In other words, is the signal at time t1 significantly different than the one observed at time t2? So what I have in mind is somehow narrower than the CDC reference you gave, but close to the Peccia et al. paper. To assess if the signals at different times are different, I would (for example) fit a distribution on the replicates for each day, and then calculate the probability that, say, signal(t1)>signal(t2) from the overlap of those two distributions. The problem (I think) is the type of distribution.

I see the normalized concentration as the ratio of two normal distributions (raw number of SARS-CoV-2 RNA copies/mL divided by raw number of, say, PMMV RNA copies/mL). Each replicate is drawn from their respective normal distribution. The problem, is that the ratio of normal distribution is Cauchy-like, and in particular its mean and std dev do not exist. So providing the mean and std dev - again, conceptually - does not make sense. Also, if there are just a few replicates (say 2 or 3), mean and std dev are really not informative. Hence, having the replicates would allow to use a non-normal distribution for the statistical analysis.

In some cases (denominator not too variable, numerator variable enough), the ratio of normal distribution can be approximated by a normal distribution. We could check that with actual data and see that I'm making a fuss for nothing ;) But I think that should be checked before settling on the reporting standard for this measurement. Having the Ottawa replicate data would be great (I think we could also have NML's).

I'm generally warry of asking too much to report to the labs. Here I hope the difference between reporting mean and std dev (2 data points) versus replicates (from what I've seen so far, about 3 data points) is acceptable. Data analysis based on replicates would do justice to the labs' hard work :)

davidchampredon commented 3 years ago

(the comment above replies to @vipileggi first comment... didn't see the second one while writing!) @vipileggi 's second comment: OK, thanks a lot.

vipileggi commented 3 years ago

thanks @davidchampredon ! Is this the Peccia paper you're referring to https://www.nature.com/articles/s41587-020-0684-z? and will you be able to provide a sample analysis based on NML data with reference to Peccia's method within the next 2-weeks? We would then be able to present this to the 3-primary labs once we have an internal discussion on a strategy to recommend changes to the reporting requirements or at least let them know of the issue more fully; and if we agree that its critical to make this change/request. I'm notifying @hswerdfe and @DougManuel about this potential development. And sorry for the short time frame but MECP wants to have the data template available for use by Jan 15, 2021.

vipileggi commented 3 years ago

Sorry @davidchampredon! I realized this morning that my previous comment may have gone well beyond this issue and I may need to start a new issue that covers the issue of what are the recommended WW surveillance 'data products' that would be useful to PHUs. I'm starting to conduct a purview of related publications and would appreciate any direction (@DougManuel , @hswerdfe ) in this area. Thanks

vipileggi commented 3 years ago

Hi @davidchampredon, I received a preliminary response from Robert Delatolla on my request and he describes the process from 'raw' to 'reported data' as follows:

The difference between raw data and reported data are such, reported data is data that we report to public health units for all of our sites. All reported data are subject to the following verification prior to reporting to ensure quality of reported data (so only verified data is reported): (i) We first define a ‘PCR LOD’ and do not report data that does not meet this criteria of the number of copies per reaction corresponding to a detection rate of ≥ 95%, as recommended by the MIQE guidelines; (ii) standard curves on the plates must show R-squared ≥ 0.95 or data is not reported; (iii) copies/reaction must be in the linear dynamic range of the standard curves or again data is not reported; (iv) primer efficiency must be between 90%-130%; and (v) sample replicates with values greater than 2 standard deviations from the mean of the triplicates are tagged as possible anomalies and are discarded (this criteria will distinguish raw data sets from reported data sets, all the others would require that the PCR plate would have to be re-ran).

Would the the 'distinct values' prior to calculating averages of normalized values be adequate for your analysis or do you need to go to an earlier step (i.e., prior to normalization by a fecal indicator, etc.)? @DougManuel and @hswerdfe if you have comments on this issue please let me know.

vipileggi commented 3 years ago

@davidchampredon was rereading you first comment and you refer to

having the concentration measurements of all the replicates

Do you mean replicate values (GC/mL WW or Sludge) prior to normalization? If so then we would need the PMMV or CrassPhage values as well.

hswerdfe commented 3 years ago

@vipileggi My main comment is on the specific issues of mean/SD vs individual values is that I largely agree with @davidchampredon in that individual values would be better to have, as means/SD generally hides a lot, like what if the response of an individual response is below LOD but not LOQ, there are many techniques that could be used to help calculate the value used in calculating mean. these might all be valid but would not be consistent across labs.

vipileggi commented 3 years ago

@hswerdfe I'm expecting to receive (Robert Delatolla agreed to provide the individual values) a sample set of individual value data and we will have an opportunity to provide a quantitative argument to address the issue @davidchampredon raised. So far both @DougManuel and @hswerdfe are aware that Robert has indicated that he is prepared to submit replicate values

if it will help? However others have not committed to it or provided alternative reporting schemes.

vipileggi commented 3 years ago

@davidchampredon attached is the data I received from Jordan Peccia (Excel file) on replicate data (both @DougManuel and @hswerdfe ) have seen this and you may already have this data (additional info at was . Wondering if we can generate a similar figure to what is on the Yale Tracker with using the mean data and individual replicate data (similar to figure below) but adding error margins in at least the predict portion of the plot. Below is an example of what the YT provides with my boxed/underlined highlighted mine. image

Copy of Final Figure 2 data (1).xlsx

vipileggi commented 3 years ago

@davidchampredon we received 'Ct' data values from Patrick D'Aoust at Ottawa U. (Excel file) and wondered (@DougManuel , @hswerdfe ) if this data is adequate or we need standard curves or direct concentration data? I'm going to ask for the concentration data now in case we will need it:

image

Copy of Sample data for Vince (MECP) - PD.xlsx

vipileggi commented 3 years ago

@davidchampredon, @DougManuel and @hswerdfe I asked Jordan Peccia about the Yale Tracker 'prediction' part of the plot in my precious message and he mentioned that they're working on finalizing the stats model and a preprint will be available hopefully in Jan. 2021.

vipileggi commented 3 years ago

@davidchampredon, @DougManuel and @hswerdfe is this concentration data in replicate what we need? I believe the concentration data is in GC N1 or N2/mL of sludge and not sure if this value has been normalized. This is from J. Peccia at https://github.com/weinbergerlab/New_Haven_Sewage/tree/master/Data image

hswerdfe commented 3 years ago

@vipileggi I am a little confused by the above, is the above counts regarding tests of people or tests of WW? and when I click the link I get an error.

anyway concentration data in replicate I the original request by @davidchampredon

So like a lab that measures N1 and N2 then normalizes by PPMoV, and also take 2 independent measurements of each, would then report 6 numbers N1_ml_1, N1_ml_2, N2_ml_1, N2_ml_2, PPMoV_ml_1, PPMoV_ml_2

If they take 3 observations of each then they would report 9 numbers

vipileggi commented 3 years ago

@hswerdfe just recopied the link this worked for me when I cut and pasted in the browser but clicking on on Preview took me to the current site. I don't understand why its doing this https://github.com/weinbergerlab/New_Haven_Sewage/tree/master/Data. Data is combined but not well documented.

davidchampredon commented 3 years ago

Hi @davidchampredon, I received a preliminary response from Robert Delatolla on my request and he describes the process from 'raw' to 'reported data' as follows:

The difference between raw data and reported data are such, reported data is data that we report to public health units for all of our sites. All reported data are subject to the following verification prior to reporting to ensure quality of reported data (so only verified data is reported): (i) We first define a ‘PCR LOD’ and do not report data that does not meet this criteria of the number of copies per reaction corresponding to a detection rate of ≥ 95%, as recommended by the MIQE guidelines; (ii) standard curves on the plates must show R-squared ≥ 0.95 or data is not reported; (iii) copies/reaction must be in the linear dynamic range of the standard curves or again data is not reported; (iv) primer efficiency must be between 90%-130%; and (v) sample replicates with values greater than 2 standard deviations from the mean of the triplicates are tagged as possible anomalies and are discarded (this criteria will distinguish raw data sets from reported data sets, all the others would require that the PCR plate would have to be re-ran).

Would the the 'distinct values' prior to calculating averages of normalized values be adequate for your analysis or do you need to go to an earlier step (i.e., prior to normalization by a fecal indicator, etc.)? @DougManuel and @hswerdfe if you have comments on this issue please let me know.

I think the data analysis (e.g., by modellers) should be made on the reported data, not the raw one (i.e. after quality control). Ideally, the reported data would be the concentration (in copies/mL, not Ct values) of all the replicates (that passed quality control) of both SARS-CoV-2 and the fecal normalizer (e.g., PMMV).

For example, if X1, X2, X3 are the concentrations of one SARS-CoV-2 gene target (say, N1 gene) of each triplicate and Y1, Y2, Y3 the concentrations of the fecal marker, then reporting X1, X2, X3, Y1, Y2, Y3 would be best. Data analysis would deal with normalizing (e.g., Xi/Yi), averaging (if needed), etc. If there was another gene target (say, N2), then reporting also the concentrations of each replicate (say Z1, Z2, Z3) for this other target would be ideal. (i.e., not averaging across the different target concentrations before reporting).

davidchampredon commented 3 years ago

@davidchampredon attached is the data I received from Jordan Peccia (Excel file) on replicate data (both @DougManuel and @hswerdfe ) have seen this and you may already have this data (additional info at was . Wondering if we can generate a similar figure to what is on the Yale Tracker with using the mean data and individual replicate data (similar to figure below) but adding error margins in at least the predict portion of the plot. Below is an example of what the YT provides with my boxed/underlined highlighted mine. image

Copy of Final Figure 2 data (1).xlsx

@vipileggi : indeed, our goal is to be able to do similar plots (among other things). @hswerdfe is building and populating the database that we (data analysts and modellers) would work from. We could have a look if the data for the location you have in mind (which one?) is in the database.

DougManuel commented 3 years ago

Checking in about this issue. We are planning a version 1.0 for next week.