cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
100 stars 62 forks source link

incorporating other changes prior to new column importation #1450

Closed melange396 closed 1 month ago

melange396 commented 1 month ago

based on #1449 with techniques similar to those used in #1442 ...

This adds some previously missing signal "Description" fields, removes {}-style variable substitution and replaces it with appropriate text in other "Description" fields, and adds previously missing "Available Geography" fields.

After this is merged, i will run the google-doc syncronization GH action once more, and then verify that it it is only adding new fields.

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

melange396 commented 1 month ago

this code, similar to that in #1442, verifies and extracts the changes to existing fields:

import pandas as pd

base_url = 'https://github.com/cmu-delphi/delphi-epidata/raw/{}/src/server/endpoints/covidcast_utils/db_signals.csv'

current = pd.read_csv(base_url.format('dev'), na_filter=False)
proposed = pd.read_csv(base_url.format('bot/update-docs'), na_filter=False)

# this code assumes columns have not been removed or renamed,
# and that no new rows have been added or had their ordering changed

new_cols = set(proposed.columns) - set(current.columns)
print("new cols:", sorted(new_cols))

non_matching = (proposed[current.columns] != current)
diffs_per_col = non_matching.apply(sum)
print(diffs_per_col)

mismatched_time = pd.concat([current[['Source Subdivision', 'Signal']], non_matching[['Description']]], axis=1)
print("Descriptions changed:")
print(mismatched_time[mismatched_time['Description']])

# produce alpha-sorted "Available Geography" from each csv for accurate comparison purposes:
current_ag_norm = current['Available Geography'].apply(lambda g: ','.join(sorted(g.split(','))))
proposed_ag_norm = proposed['Available Geography'].apply(lambda g: ','.join(sorted(g.split(','))))
diff_geos = (current_ag_norm != proposed_ag_norm)
print("Number of rows with different geographies:", sum(diff_geos))
mismatched_geos = pd.concat([current[['Source Subdivision', 'Signal']], diff_geos], axis=1)
print(mismatched_geos[mismatched_geos['Available Geography']])
# This shows that these 44 rows with different geographies are accounted for by these rows that have empty geos:
print(
      (current.iloc[380:394]['Available Geography']=='').all(),
      (current.iloc[419:449]['Available Geography']=='').all(),
      (394-380)+(449-419)==44
      ) 

intermediate = current.copy()
intermediate.iloc[380:394]['Available Geography'] = proposed.iloc[380:394]['Available Geography']
intermediate.iloc[419:449]['Available Geography'] = proposed.iloc[419:449]['Available Geography']
intermediate['Description'] = proposed['Description']
intermediate.to_csv('intermediate.csv', index=False)
import os
for _ in range(2):
    os.system("sed -i 's/,False,/,FALSE,/g' intermediate.csv")
    os.system("sed -i 's/,True,/,TRUE,/g' intermediate.csv")