18F / site-scanning

The code base for the first Site Scanning engine
https://digital.gov/site-scanning
18 stars 9 forks source link

Run an experimental scan for sloppy results #66

Closed gbinal closed 5 years ago

gbinal commented 5 years ago
gbinal commented 5 years ago

This isn't urgent but we should do it at some point. A one off scan of all of this will help us make some important calls later on about the project.

timothy-spencer commented 5 years ago

What data did you need in the CSV? If I cut out the pagedata scanner entirely, the 22k scan will probably complete without issues, leaving us with resultcode and USWDS data. Is that enough for you to make good calls? As of now, the pagedata scanner is running out of memory probably because some folks have a big data.json file or something like that.

Ideally, you would say "oh, we only need the 200 scan resultcode data here", because that would run quickly and not run out of memory because it is not downloading the actual pages. Hope springs eternal. If you can't say that, then I can try to figure out a way. :-)

timothy-spencer commented 5 years ago

https://drive.google.com/open?id=1449DsR7Gq7Eoppk2DiQAmkRcrFfUeTmv has a basic 200 scan. I am still working on mixing in the final_url data, and getting the USWDS big scan working.

The slowness on this is because some of these pages are big, which can run us out of memory and cause the scan to crash. I am working on trying to read it in chunks so that we use a fixed amount of memory now.

timothy-spencer commented 5 years ago

https://drive.google.com/open?id=1NzywwkChmPETQ_2Wrxe0qBstUBOwiE2h has the USWDS scan in it.

timothy-spencer commented 5 years ago

https://drive.google.com/open?id=1N_NY46ezw-c8d0C5Hq4-V4m1NrCLeQfL is the scan that has the sloppy pages and final_url in it. I believe this is the final deliverable here, so I am closing this!

gbinal commented 5 years ago

Alas, there's still something to do with the finalURL data file, so reopening for now.

timothy-spencer commented 5 years ago

I had a typo (forgot a ,!), so re-ran the scan:

https://drive.google.com/open?id=1N_NY46ezw-c8d0C5Hq4-V4m1NrCLeQfL

Let's hope you don't have to reopen this again! :-)

gbinal commented 5 years ago

One more time! The last, I promise!*

We just need to add these:

And we can now leave off if helpful:

gbinal commented 5 years ago

We've gotten a ton of actionable data from this and can close this now.