Open btylerburton opened 9 months ago
processing reached 12 hours for the noaa waf so i stopped it ( the conclusion being...it's gonna take awhile ). I duplicated our waf test but added a new fixture with an updated url. I didn't commit anything. considering how long it was running, I didn't see the benefit of knowing exactly how much longer it would take. the bottleneck is requesting/downloading the documents. requesting the initial page, parsing it with beautifulsoup, and getting a list of all the anchors with a populated href
attr took 46 seconds ( this is our waf traversal function ).
conclusion of test
json with list of all waf urls waf_sources.json
pausing on this. more discussion on waf needed.
User Story
In order to harvest WAF sources effectively and at scale, datagovteam would like to harden the current WAF ETL pipeline.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
Background
[Any helpful contextual notes or links to artifacts/evidence, if needed]
Security Considerations (required)
[Any security concerns that might be implicated in the change. "None" is OK, just be explicit here!]
Sketch