Scalabull / get-tested-covid19

Open source code for community-driven, US-focused COVID-19 test locator database.
https://get-tested-covid19.org
MIT License
7 stars 18 forks source link

Automating the processing and deployment of testSites.json #21

Closed zboldyga closed 4 years ago

zboldyga commented 4 years ago

The test site 'database' is just a JSON file that loads into the website user's browser via AJAX. We are storing this file at: https://storage.googleapis.com/covid19-resources/testSites.json .

This 'database' is stored in a Google Sheet. We would like to keep it in the Google Sheet, for several reasons:

To turn this spreadsheet data into JSON at the above link, we currently do the following steps:

It would be great to automate this process.

If you want to implement this, just comment below!

abshomali commented 4 years ago

I should be able to implement this. I'm thinking we add a column to the sheet that the reviewers can update with a value indicating whether that entry is approved. I'll write a Python scripts that uses the Sheets API to check that column and retrieve the entries that are approved. There's no need to export and parse the whole thing, so don't expect cron job to be a problem. It would then format that data into a JSON that would get uploaded to GCS with the desired caching configs. Haven't done that last bit but I'm sure I can figure out if I read their docs. I'm a data analyst/scientist not an engineer, so please let me know if I'm overlooking anything.

zboldyga commented 4 years ago

@abshomali Here's the JSON currently hosted at GCS: https://storage.googleapis.com/covid19-resources/testSites.json .

For clarity, here are the JSON fields and their corresponding column numbers (0-indexed):

const currSite = {
        name: col[2],
        address: col[3],
        city: col[4],
        state: col[5],
        zip: col[6],
        lat: parseFloat(col[7]),
        lng: parseFloat(col[8]),
        phone: col[9],
        hours: col[10],
        days: col[11],
        opPeriod: col[12],
        link: col[13],
        appReq: col[14],
        docScreen: col[15],
        driveThru: col[16],
        walkUp: col[17],
        description: col[18],
        estCap: col[19],
        comments: col[20]
    }

Column 0 is the timestamp the entry was inserted. Column 1 is the 'Accepted' column, which will contain '1' if the row is accepted.

If you write a python script that can be run from the command line, I can write a small script to start-up a cron job and I can add to the package.json to deploy this in the CI/CD process.

zboldyga commented 4 years ago

Additionally, there is a bit of data cleanup that would be really handy!

zboldyga commented 4 years ago

Would you be OK with using your own credentials for GCS to start? I was thinking about this just now; I will have to set up two environments for staging and prod, add some configs in the bootup scripts to separate this, etc. Might take some tinkering on my end, don't want to hold you up!

If that works, then when you are ready to deploy I can plug the credentials into the project and configure the right settings for the staging and prod deployments.

abshomali commented 4 years ago

This is great, thanks for sharing the JSON details. Yup, I'll set up my own GCS bucket and test this out with it. It'll make testing easier too. Will look to do data validation as well. I'll comment here if I have other questions.

abshomali commented 4 years ago

Can you give some context to the step where you Manually adjust the cache rules on the file so that the file expires in browser caches every 7200 seconds.? It seems like it should be a characteristic of each blob but can't find how to set it, so if you have any guesses as to the name of the attribute I need to set, that would be helpful. Or just link me to an article that explains it if you know of one. Thanks!

zboldyga commented 4 years ago

@abshomali Actually, I was just reading about the cache-control metadata tag here:

https://cloud.google.com/storage/docs/metadata?hl=en_US&_ga=2.147992055.-1345538864.1585370861#cache-control

It looks like the default is: public, max-age=3600 , which means we don't need to do anything! The objects will automatically expire in the user's cache every 3600 seconds (1 hour), so we don't have to worry much about users having stale data in their browsers :)