NYCPlanning / db-cpdb

Capital Projects Database
https://nycplanning.github.io/db-cpdb
0 stars 0 forks source link

Refactor capital spending #63

Closed SPTKL closed 3 years ago

SPTKL commented 3 years ago

First attempt:

I first tried to refactor the capital spending scraping process using aiohttp request the checkbooknyc api, however it still timed out because most of the records returns have very few results. so there's no performance advantage. Also multiple queries failed and results were not captured

Second try:

  1. Using an airflow dag to request and store every single day of checkbooknyc capital spending data
  2. store data to gcs -> bigquery
  3. the current capital spending workflow will pull the latest fisa_capitalcommitments -> store it in bigquery -> then we will use a query to select all the relevant records from bigquery

test query here:

SELECT DISTINCT * FROM `checkbooknyc_capital_spending.*`             
WHERE CAST(TRIM(LEFT(capital_project,12)) AS STRING) IN (               
         SELECT DISTINCT LPAD(CAST(managing_agcy_cd AS STRING), 3, '0')||REPLACE(project_id,' ','')                
         FROM `fisa_capitalcommitments.20210501`
)

Notes

  1. bq and gsutil are installed when setting up google cloud sdk
  2. the service account we are using has access to both cloud storage and big query
  3. default location for all our tables in bigquery is US
  4. bq: load data from csv
  5. bq: mk datasets and tables
  6. bq: query data and store to table
  7. bq: export table to csv