cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Wastewater: NWSS SARS-Cov2 datasets #1901

Open willtownes opened 1 year ago

willtownes commented 1 year ago

Publicly available signals from the National Wastewater Surveillance System (NWSS)

Data details

There are basically three signals from two datasets. Both datasets are served via a Socrata API.

In the metrics dataset we have the percentile signal which is the highest priority of the three along with the less useful ptc_15d signal. Varkila et al 2023 is a useful reference showing how these two signals might be used. There are also some other possible signals pertaining to whether gene copies were detected or not. These data are normalized by flow rates and/or population (not fecal biomarkers like PMMoV), and the percentile signal is also normalized by its maximum value within each site.

In the concentration dataset we have a smoothed, normalized metric called pcr_conc_smoothed . The normalization is not the same across sites, some of the values are normalized using human fecal biomarkers (usually PMMoV)

Geographic resolution: roughly county level (it is really at the wastewater treatment plant catchment basin level which may be more or less than the size of a county). Temporal resolution: intermittent, average of 1-3 observations per week. I don't think any processing is required (perhaps later we can think about mapping the numbers to the county level by dividing by population sizes etc). Note that they drop many of the locations that have small population sizes for privacy reasons. I recommend including all of the columns from both datasets. I don't anticipate there will be substantial backfill effects but it might be worth tracking it to find out.

Additional context

willtownes commented 1 year ago

@nmdefries ready for your review

dsweber2 commented 11 months ago

Looking at the datasets, its looking like there may be some gnarly geo_coding issues, at least with the metrics dataset. It looks like sample sites have situations they're tied to multiple fips (at most 2-3), and some fips having multiple wwtp_id's (anywhere from 1-30). And in addition some of the sites where they're sampling upstream of the wastewater treatment plant have multiple sampling locations tied to them. Maybe we can use the key_plot_id column we can sidestep this, but I suspect making signals at the county level will be messy, and state level too. I'm planning on setting up a tracker locally to see if there's much revision going on.

nmdefries commented 11 months ago

Hm, sounds complicated.

From the data source, county_id id is described as

5-digit numeric FIPS codes of all counties and county equivalents served by this sampling site (i.e., served by this wastewater treatment plant or, if 'sample_location' is "upstream", then by this upstream location). Note that multiple sampling sites or treatment plants may serve a single county, and that a single sampling site or treatment plant may serve multiple counties. Counties listed may be entirely or only partly served by this sampling site.

There seems to be an emphasis on anonymity of the sites, which means that, although the data includes population estimates for the sample locations, a direct mapping of sample site and, e.g., % of each county served will not be available. Unclear how much of the geographic arrangement of sites can be reconstructed using the combo of pop and FIPS. Maybe there are existing mappings of WW treatment plant to FIPS, etc?

@willtownes What percent of wastewater treatment plants or of the US population does this dataset cover? Do we know if a given area can be served by both an upstream sample site and a WW treatment plant?

nmdefries commented 11 months ago

Do any of these sampling sites cross state boundaries? Edit: This page about another NWSS signal says they can.

New York state WW treatment plant boundaries paper.

willtownes commented 11 months ago

Looking at the datasets, its looking like there may be some gnarly geo_coding issues, at least with the metrics dataset. It looks like sample sites have situations they're tied to multiple fips (at most 2-3), and some fips having multiple wwtp_id's (anywhere from 1-30). And in addition some of the sites where they're sampling upstream of the wastewater treatment plant have multiple sampling locations tied to them. Maybe we can use the key_plot_id column we can sidestep this, but I suspect making signals at the county level will be messy, and state level too.

Here's my response to David's comment that was previously on slack:

Yes, the geography of the wastewater plants (catchment areas) is not neatly aligned with political boundaries. We are currently negotiating to get the "full" version of the NWSS data, which we will not be able to publish on the API, but I think it might have more detailed information about the plants such that we could try to do some GIS tricks to intersect the catchment polygons with the county polygons. We would then have to decide whether or how to apportion the WW numbers to the corresponding portions of counties. Let's call this "strategy 1". Other possibilities: (2) can we define a new type of geo code that essentially uniquely identifies a wastewater catchment area? We could then provide a cross-mapping table that lists the association between each catchment area and corresponding counties (and vice versa), and let the users figure out how they want to apportion (if at all) to the counties. (3) if we get the lat/long of the treatment plants, we could do some spatial interpolation and just use those numbers to get county- and state-level data. (basically this is a variant of (1). (4) we could just exclude any sites that do not uniquely map to a single county- I don't like this option but it is the easiest.

willtownes commented 11 months ago

There seems to be an emphasis on anonymity of the sites,

Correct.

Although the data includes population estimates for the sample locations, a direct mapping of sample site and, e.g., % of each county served will not be available.

Not with the public dataset we currently have access to, but we may be able to get at this via the "full" or "private" version of the NWSS dataset, which we are in the process of getting the data use agreement and I expect it to be approved in the next few weeks.

Unclear how much of the geographic arrangement of sites can be reconstructed using the combo of pop and FIPS. Maybe there are existing mappings of WW treatment plant to FIPS, etc?

The current data should be sufficient to know whether a site is associated with a county or not (ie that the catchment basin of the plant intersects with the county boundary). However, we have no idea if it is just a small sliver of a county or the entire county.

What percent of wastewater treatment plants or of the US population does this dataset cover?

I don't know the exact number (it may be listed on the NWSS website), but if I had to guess I would say it probably covers something like 30-40% of the US population and maybe 50-60% of the treatment plants. Keep in mind that it will pretty much never capture people in rural areas who have septic tank even in the future.

Do we know if a given area can be served by both an upstream sample site and a WW treatment plant?

I believe this is possible but I don't have a concrete example off the top of my head.

dsweber2 commented 11 months ago

try to do some GIS tricks to intersect the catchment polygons with the county polygons 1/3

Doesn't this end up assuming that spatial proportions correspond to population proportions? Probably better than a uniform split, but sounds like a lot of work.

(4) we could just exclude any sites that do not uniquely map to a single county

I will say there are only 7 wastewater plants out of 1538 (or 13 key_plot_id's out of 1735) that don't correspond to unique counties, so they're very much exceptions. They are

         population_served        wwtp_jurisdiction                                       county_names
wwtp_id
569                 23,896         [North Carolina]                            [Lenoir| Greene,Lenoir]*
1729                28,284               [Michigan]                           [Wayne, Macomb, Oakland]
1735                83,953                [Florida]                        [Seminole| Orange,Seminole]*
11                  91,995                   [Utah]                        [Utah,Salt Lake| Salt Lake]*
363                156,426             [California]           [San Francisco,San Mateo| San Francisco]*
1108               254,089              [Minnesota]              [Scott,Dakota | Scott,Hennepin,Dakota]*
788              1,500,000             [New Jersey]  [ Essex,Hudson,Passaic,Bergen | Essex,Hudson,Union,Passaic,Bergen ]*

* Some of the county_fips for a given sampling location contain multiple counties, so there's some weird redundancy here. | separates sampling sites

Reading more carefully, I guess the actual key (which they provide as a separate column) for the metric dataset is (wwtp_jurisdiction, wwtp_id, sample_location_specify, sample_matrix), with the last two only if it's upstream of the treatment plant. They refer to this as a "sewershed". The other dataset uses the same key, but doesn't provide the sub-fields separately.

What percent of wastewater treatment plants or of the US population does this dataset cover?

Population is fairly easy to calculate: 207,990,158/335,598,849≅62%. There are a total of 1538 wastewater treatment plants and 1735 sampling locations. No idea about the denominator though.

Something I'm not sure what to make of: looking for a count of unique population_served in each county when there's more than one plant there (as a way to see whether or not they're allocating the population of a county into the sewersheds), I get somewhat inconsistent counts:

             population_served  key_plot_id
county_fips
26075                        1            2
20057                        1            2
18177                        1            2
18167                        1            2
18157                        1            2
...                        ...          ...
26139                       13           13
04013                       20           13
17031                       21           21
49043                       22            2
48201                       29           30

So county 26075 has the same count at each key_plot_id, so either they just evenly divided the total, or used the total county population in each; I'd need to check with a by-county population record to figure out which it is. This might let us use the zip code approach that @nmdefries mentioned we use on another data source. Then again, that 04013 has 20 unique population_served values but only 13 keys suggests that something weird is going on.

If there's particular questions we want to ask about the dataset I'd be happy to write some code for that. I've got a notebook (technically an emacs org notebook which I still need to export to a more sharable format) where I'm collecting analysis of this dataset. Any suggestions for where to store that?

dsweber2 commented 11 months ago

Talking with @carlynvandyke, I'm wondering if we want to skip over the county complications for now and just do state-level data, which should be fairly straightforward. Though the change to verily seems like it may be causing some difficulties with the data.

willtownes commented 11 months ago

I am fine with starting out at the state level but hope that eventually we can figure out a county-level strategy. We could split off the county-level stuff into a separate, lower priority issue to be worked on when there is time/interest. I wouldn't want that to block getting some kind of wastewater signal into the API faster even if it isn't the ideal resolution.

Regarding the transition to Verily, yes it will probably be a significant change from the Biobot protocol. Anna did some exploratory investigations and found that the historical Verily (=SCAN) and Biobot numbers were only weakly correlated in places where they both sampled simultaneously.

nmdefries commented 11 months ago

historical Verily (=SCAN) and Biobot numbers were only weakly correlated

In that case, should we report the Verliy and Biobot values under different signal names?

willtownes commented 11 months ago

In that case, should we report the Verliy and Biobot values under different signal names?

I don't think so. Even in the original version of NWSS which was run by Biobot, there were SCAN and other non-Biobot records included. In the key_plotid, where it says WWS I think that indicates wastewater SCAN. We should refer to this dataset as "NWSS" not as Biobot or Verily/ SCAN. This is because later on we may be able to separately include data provided directly from either Biobot or SCAN that is not subject to the NWSS restrictions and/or that has additional information. It would probably be helpful to include a note in the metadata that there was a change in contractor in October 2023 in case future users are confused about any discrepancies.

dsweber2 commented 11 months ago

oh, looking at the metric data, at the state level should we be using a population weighted average for the percentile and ptc_15d columns?

Also, do we want to include the detect_prop_15d column? It is described as:

The proportion of tests with SARS-CoV-2 detected, meaning a cycle threshold (Ct) value <40 for RT-qPCR or at least 3 positive droplets/partitions for RT-ddPCR, by sewershed over the 15-day window defined by 'date_start' and "date_end'. The detection proportion is the percent calculated by dividing the 15-day rolling sum of SARS-CoV-2 detections by the 15-day rolling sum of the number of tests for each sewershed and multiplying by 100

Not sure if this is the column you were referring to when you were talking about copy number.

Oh, and on the raw values:

In the concentration dataset we have a smoothed, normalized metric called pcr_conc_smoothed . The normalization is not the same across sites, some of the values are normalized using human fecal biomarkers (usually PMMoV)

Reflecting on this, I guess this dataset can't really be aggregated in a meaningful way, so will have to be reported using a new geo-type at the sampling location level

willtownes commented 11 months ago

I think it's OK to aggregate the concentration values to state level using population weights, but not the metric values (percentile, percent change). Metric values should probably just have the new geo-type. The detect_prop_15d column is not something I myself have used much but may be of interest to the wider community. I think it would be most helpful in cases where the infection levels are low and you care about presence/absence more than quantification.

dsweber2 commented 11 months ago

Oh interesting, could you talk a bit more about why the metric values shouldn't be aggregated weighted on population? I would've expected the fact that they're all on the same scale (%) would mean they're more comparable, but I'm not that familiar with the data, so I'm probably misunderstood something. I had thought the problem was with concentration values, where the relationship to actual number of cases varies by location.

willtownes commented 11 months ago

The way the percentile is defined is the problem. For each site, they find the maximum historical concentration and divide all other concentration values by that number. The problem is, the max value is not the same across sites, so the interpretation of the percentile is not comparable across sites. It doesn't make sense IMO to average across sites to get eg a state-wide number. It would be better if they had divided by the maximum across all sites instead of the max for each site separately.

dsweber2 commented 11 months ago

So I have weird news: it looks like there's frequent changes in the pcr_conc_smoothed data. Here's an example from June 2022 that has changes in ~ the 4th digit roughly every 2 days

                                        key_plot_id        date  pcr_conc_smoothed    normalization                      issue
253544   NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186360700.0  flow-population 2023-10-17 14:52:28.829755
709853   NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186227500.0  flow-population 2023-10-18 17:03:56.272893
899174   NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186227400.0  flow-population 2023-10-20 14:08:36.830290
1080860  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186384700.0  flow-population 2023-10-23 11:08:09.292074
1183394  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186411400.0  flow-population 2023-10-24 02:08:09.872367
1324400  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186537900.0  flow-population 2023-10-27 02:08:11.826278
1614668  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186537800.0  flow-population 2023-10-30 02:08:10.331810
1793186  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186614100.0  flow-population 2023-10-31 02:08:08.438599
1978867  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186638000.0  flow-population 2023-11-03 02:08:22.196753
2256546  NWSS_wi_231_Treatment plant_raw wastewater  2022-06-27        186602400.0  flow-population 2023-11-07 02:08:07.873418

Around 75% of days have some sort of revision behavior going on. I'm not sure if this is noise or what exactly. Going to dig into it more, just wanted to flag this to get your thoughts @willtownes

willtownes commented 10 months ago

That's really weird. I would recommend sending a message to John Person via the "contact dataset owner" button here. Hopefully they can tell us what's going on. I would not have expected that much revision to be happening.

dsweber2 commented 10 months ago

In case anyone besides Will is following this: after emailing them, these numbers are post-cubic spline interpolation.

Yes, we do expect to see this much variation in the smoothed PCR concentrations. The concentrations shown are predictions from a cubic spline regression, and therefore they change slightly every time new data is uploaded.

nmdefries commented 10 months ago

That's odd.... And also means that this source will have a lot of revisions and thus use a lot of space in our DB, even using the archive differ. Do we need to worry about that? @korlaxxalrok

willtownes commented 10 months ago

Another frustrating consequence of them not providing raw values. They did say in the email that they hope to provide the reported concentrations in the near future. If that became available, I would favor using it instead, and deprecating the smoothed concentrations or shifting it to be a "derived" signal that we compute ourselves from the reported concentrations, which would not require saving all the revisions. Of course we have no idea how much time it will be before it becomes available though.

dsweber2 commented 10 months ago

Hopefully the faster than expected turn-around indicates this actually getting some priority, so we could provide the rawest form of concentrations.

@melange396 and I looked at the number of rows added per day, and the fine grained version adds ~100k/day, which was on par with many other sources (hhs being kind of the odd one out at the ~1m level). Kinda wish I'd taken notes. Also, if I understood him correctly, the archive differ is for cases where there's a small amount of revision, and since ~75% are getting revised in some way*, the differ would mostly end up doing useless churn at this level.

Of course, this is assuming we actually include the data exactly as they provide it. I asked them for method details, and we could probably round to the number of sig digits preserved numerically by whatever spline they're using.

* this is still a pretty weird number of revisions, since I would expect ~100% given that it's numeric noise. I guess the 7 digit rounding I was doing caught a significant fraction.

dsweber2 commented 10 months ago

Oh, follow up from this morning: I confirmed that in the dataset, only one sampling procedure is used at each location, and that they don't sample both upstream and at the wastewater treatment plant.

dsweber2 commented 8 months ago

So I've been doing some more digging since it's on staging, and I noticed a couple of strange things. First, there are a number of values that are many orders of magnitude smaller over a short period:

+--------------+-----------+----------+------------+-------------------+
| epimetric_id | geo_value | issue    | time_value | value             |
+--------------+-----------+----------+------------+-------------------+
|   5513019730 | ar        | 20240111 |   20231124 |            0.0019 |
|   5513019779 | ar        | 20240111 |   20231125 |            0.0021 |
|   5513019828 | ar        | 20240111 |   20231126 |            0.0023 |
|   5513019877 | ar        | 20240111 |   20231127 |  35063584.8527016 |
|   5513019926 | ar        | 20240111 |   20231128 |  42082598.4041872 |
|   5513019975 | ar        | 20240111 |   20231129 |  42168322.4112862 |

Turns out this is the date that went from only having WWS at one site to having; there's a similar drastic change in September when Biobot was dropped. 1db1321f5edd2f77dbba7ff18418daf755825dfc Doing a couple of histograms of each source by normalization scheme, I'm beginning to suspect we may need to make separate signals based on provider and normalization scheme. 7edf6c19a76c2f17b16e02ee0845210a13ef874c

Another thing I found in the original data while investigating that is 14,182 negative values

df_concentration[(df_concentration.pcr_conc_smoothed < 0) & ~(df_concentration.pcr_conc_smoothed >= 0)].sort_values(by = "pcr_conc_smoothed")
                                              key_plot_id  timestamp  pcr_conc_smoothed    normalization
16739   NWSS_az_1055_Before treatment plant_83_raw was... 2022-10-07      -9.979245e+09  flow-population
17165   NWSS_az_1055_Before treatment plant_83_raw was... 2022-07-05      -9.834369e+09  flow-population
16746   NWSS_az_1055_Before treatment plant_83_raw was... 2022-10-16      -9.813554e+09  flow-population
17168   NWSS_az_1055_Before treatment plant_83_raw was... 2022-07-13      -9.599210e+09  flow-population
16694   NWSS_az_1055_Before treatment plant_83_raw was... 2022-07-04      -9.527362e+09  flow-population
...                                                   ...        ...                ...              ...
301594  WWS_ca_2404_Treatment plant_569_post grit removal 2023-03-12      -1.000000e-04        microbial
301593  WWS_ca_2404_Treatment plant_569_post grit removal 2023-03-10      -1.000000e-04        microbial
296134  WWS_ca_1872_Before treatment plant_468_primary... 2022-11-06      -1.000000e-04        microbial
321140      WWS_nc_1894_Treatment plant_post grit removal 2023-05-11      -1.000000e-04        microbial
659197      WWS_md_1782_Treatment plant_post grit removal 2023-04-30      -1.000000e-04        microbial

[14182 rows x 4 columns]

That's out of ~6 million values; I would write off the e-04 values as approximately zero, but the e+09 value is concerning

dsweber2 commented 8 months ago

Oh, follow-up from a discussion with Will: here's a timeseries plot of a problematic example: 9e55a3d1891b3c74f2e38bd4646868c980a31990 The fill goes to 0. So it is smoothly going negative, and then has some weird gaps