cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
100 stars 62 forks source link

Pending updates to columns in db_signals.csv #1442

Closed melange396 closed 1 month ago

melange396 commented 1 month ago

This PR is derived from #1434; i removed all of the new columns but this should include all of the changes to the existing columns (except Available Geography, more on that in a bit).

Please let me know if we need to fix any of these -- the summary of differences appears to me to be:

The Available Geography column has some sweeping changes applied to it... In one example from chng, the text was modified from county,hhs,hrr,msa,nation,state to county, hrr (by Delphi), msa (by Delphi), state (by Delphi), hhs (by Delphi), nation (by Delphi). I believe this signifies that only county data came from the source, and we computed the various other higher levels of geo aggregation. This is valuable information, but i would suggest we keep the column the way it was and create a new column called something like Geographies aggregated by Delphi or Post-aggregated geographies that lists the geography types that were extrapolated by us. There are a few reasons for doing it this way, including that (i believe) the Signal Documentation app expects the structured comma-separated text without the extra annotations as it was before, and that representing the same information in its own column should save some space. If you agree with this, let me know as i think i should be able to apply those changes pretty easily. Also, some entries (like quidel for instance) have "(by Delphi)" attached to every geography in the list; that suggests to me that we did aggregations to produce county-level data from finer-grained locations, but i didn't think that was the case.

melange396 commented 1 month ago

Just kidding! Those n/a values were not actually removed in the source spreadsheet nor in #1434 -- i inadvertently stripped them due to the way i imported the csv files... I edited the above message to strikethrough the irrelevant text.

melange396 commented 1 month ago

Here is some code that you can paste into a python interpreter to see the (correct) list of differences:

import pandas as pd

base_url = 'https://github.com/cmu-delphi/delphi-epidata/raw/{}/src/server/endpoints/covidcast_utils/db_signals.csv'

current = pd.read_csv(base_url.format('dev'), na_filter=False)
proposed = pd.read_csv(base_url.format('bot/update-docs'), na_filter=False)

new_cols = set(proposed.columns) - set(current.columns)
print(new_cols)

non_matching = (proposed[current.columns] != current)
diffs_per_col = non_matching.apply(sum)
print(diffs_per_col)

mismatched_time = pd.concat([current[['Source Subdivision', 'Signal']], non_matching[['Time Type']]], axis=1)
print(mismatched_time[mismatched_time['Time Type']])
sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

melange396 commented 1 month ago

and then the csv in this PR was produced by following the above code snippet with this:

intermediate = proposed[current.columns]
intermediate['Available Geography'] = current['Available Geography']
intermediate.to_csv('intermediate.csv', index=False)
import os
for _ in range(2):
    os.system("sed -i 's/,False,/,FALSE,/g' intermediate.csv")
    os.system("sed -i 's/,True,/,TRUE,/g' intermediate.csv")
melange396 commented 1 month ago

the source data in the google sheet has changed since this was done; closing this PR to create a new one...