Refactor capital spending

First attempt:

I first tried to refactor the capital spending scraping process using aiohttp request the checkbooknyc api, however it still timed out because most of the records returns have very few results. so there's no performance advantage. Also multiple queries failed and results were not captured

Second try:

Using an airflow dag to request and store every single day of checkbooknyc capital spending data
store data to gcs -> bigquery
the current capital spending workflow will pull the latest fisa_capitalcommitments -> store it in bigquery -> then we will use a query to select all the relevant records from bigquery

test query here:

SELECT DISTINCT * FROM `checkbooknyc_capital_spending.*`             
WHERE CAST(TRIM(LEFT(capital_project,12)) AS STRING) IN (               
         SELECT DISTINCT LPAD(CAST(managing_agcy_cd AS STRING), 3, '0')||REPLACE(project_id,' ','')                
         FROM `fisa_capitalcommitments.20210501`
)

Notes

bq and gsutil are installed when setting up google cloud sdk
the service account we are using has access to both cloud storage and big query
default location for all our tables in bigquery is US
bq: load data from csv
bq: mk datasets and tables
bq: query data and store to table
bq: export table to csv

NYCPlanning / db-cpdb

Refactor capital spending #63

First attempt:

Second try:

Notes