I first tried to refactor the capital spending scraping process using aiohttp request the checkbooknyc api, however it still timed out because most of the records returns have very few results. so there's no performance advantage.
Also multiple queries failed and results were not captured
Second try:
Using an airflow dag to request and store every single day of checkbooknyc capital spending data
store data to gcs -> bigquery
the current capital spending workflow will pull the latest fisa_capitalcommitments -> store it in bigquery -> then we will use a query to select all the relevant records from bigquery
test query here:
SELECT DISTINCT * FROM `checkbooknyc_capital_spending.*`
WHERE CAST(TRIM(LEFT(capital_project,12)) AS STRING) IN (
SELECT DISTINCT LPAD(CAST(managing_agcy_cd AS STRING), 3, '0')||REPLACE(project_id,' ','')
FROM `fisa_capitalcommitments.20210501`
)
Notes
bq and gsutil are installed when setting up google cloud sdk
the service account we are using has access to both cloud storage and big query
default location for all our tables in bigquery is US
First attempt:
I first tried to refactor the capital spending scraping process using aiohttp request the checkbooknyc api, however it still timed out because most of the records returns have very few results. so there's no performance advantage. Also multiple queries failed and results were not captured
Second try:
test query here:
Notes
bq
andgsutil
are installed when setting up google cloud sdkUS