GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
607 stars 98 forks source link

Harden WAF ETL pipeline #4598

Open btylerburton opened 8 months ago

btylerburton commented 8 months ago

User Story

In order to harvest WAF sources effectively and at scale, datagovteam would like to harden the current WAF ETL pipeline.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

Background

[Any helpful contextual notes or links to artifacts/evidence, if needed]

Security Considerations (required)

[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]

Sketch

rshewitt commented 6 months ago

noaa waf

rshewitt commented 6 months ago

processing reached 12 hours for the noaa waf so i stopped it ( the conclusion being...it's gonna take awhile ). I duplicated our waf test but added a new fixture with an updated url. I didn't commit anything. considering how long it was running, I didn't see the benefit of knowing exactly how much longer it would take. the bottleneck is requesting/downloading the documents. requesting the initial page, parsing it with beautifulsoup, and getting a list of all the anchors with a populated href attr took 46 seconds ( this is our waf traversal function ).

conclusion of test

rshewitt commented 6 months ago

json with list of all waf urls waf_sources.json

rshewitt commented 6 months ago

pausing on this. more discussion on waf needed.