Closed melange396 closed 1 month ago
Just kidding! Those n/a
values were not actually removed in the source spreadsheet nor in #1434 -- i inadvertently stripped them due to the way i imported the csv files... I edited the above message to strikethrough the irrelevant text.
Here is some code that you can paste into a python interpreter to see the (correct) list of differences:
import pandas as pd
base_url = 'https://github.com/cmu-delphi/delphi-epidata/raw/{}/src/server/endpoints/covidcast_utils/db_signals.csv'
current = pd.read_csv(base_url.format('dev'), na_filter=False)
proposed = pd.read_csv(base_url.format('bot/update-docs'), na_filter=False)
new_cols = set(proposed.columns) - set(current.columns)
print(new_cols)
non_matching = (proposed[current.columns] != current)
diffs_per_col = non_matching.apply(sum)
print(diffs_per_col)
mismatched_time = pd.concat([current[['Source Subdivision', 'Signal']], non_matching[['Time Type']]], axis=1)
print(mismatched_time[mismatched_time['Time Type']])
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication
and then the csv in this PR was produced by following the above code snippet with this:
intermediate = proposed[current.columns]
intermediate['Available Geography'] = current['Available Geography']
intermediate.to_csv('intermediate.csv', index=False)
import os
for _ in range(2):
os.system("sed -i 's/,False,/,FALSE,/g' intermediate.csv")
os.system("sed -i 's/,True,/,TRUE,/g' intermediate.csv")
the source data in the google sheet has changed since this was done; closing this PR to create a new one...
This PR is derived from #1434; i removed all of the new columns but this should include all of the changes to the existing columns (except
Available Geography
, more on that in a bit).Please let me know if we need to fix any of these -- the summary of differences appears to me to be:
day
"' removed fromTime Type
column, replaced with empty string: (this seems like it was accidental)dsew-cpr
:confirmed_admissions_covid_1d_7dav
"n/a
" removed fromPathogen/Disease Area
column, replaced with empty string: (these seem intentional)nchs-mortality
:deaths_allcause_incidence_num
nchs-mortality
:deaths_allcause_incidence_prop
nchs-mortality
:deaths_percent_of_expected
safegraph-daily
:completely_home_prop
safegraph-daily
:completely_home_prop_7dav
safegraph-daily
:full_time_work_prop
safegraph-daily
:full_time_work_prop_7dav
safegraph-daily
:median_home_dwell_time
safegraph-daily
:median_home_dwell_time_7dav
safegraph-daily
:part_time_work_prop
safegraph-daily
:part_time_work_prop_7dav
safegraph-weekly
:bars_visit_num
safegraph-weekly
:bars_visit_prop
safegraph-weekly
:restaurants_visit_num
safegraph-weekly
:restaurants_visit_prop
The
Available Geography
column has some sweeping changes applied to it... In one example fromchng
, the text was modified fromcounty,hhs,hrr,msa,nation,state
tocounty, hrr (by Delphi), msa (by Delphi), state (by Delphi), hhs (by Delphi), nation (by Delphi)
. I believe this signifies that only county data came from the source, and we computed the various other higher levels of geo aggregation. This is valuable information, but i would suggest we keep the column the way it was and create a new column called something likeGeographies aggregated by Delphi
orPost-aggregated geographies
that lists the geography types that were extrapolated by us. There are a few reasons for doing it this way, including that (i believe) the Signal Documentation app expects the structured comma-separated text without the extra annotations as it was before, and that representing the same information in its own column should save some space. If you agree with this, let me know as i think i should be able to apply those changes pretty easily. Also, some entries (likequidel
for instance) have "(by Delphi)
" attached to every geography in the list; that suggests to me that we did aggregations to produce county-level data from finer-grained locations, but i didn't think that was the case.