GSA / site-scanning

The central repository for the Site Scanning program
11 stars 2 forks source link

[need to break into multiple issues] update index building process #986

Closed gbinal closed 1 month ago

gbinal commented 1 month ago

the main goal is to get .mil sites in the omb_idea file into the .mil index, but the whole process is worth a quick thinking through

gbinal commented 1 month ago

(bold are more urgent; not bold less)

1) ~We are applying the executive branch too broadly with the .gov websites. Instead of the current logic, base domains should be compared against this list and then applied. An example where this is an issue is websites from the uscourts source list.~
2) ~base domain agency and bureau should be pulled in for websites on the pulse list the same way as the rest (maybe also the .gov domain list if we want to just use the same method across everybody, but not a huge deal)~ 3) ~we need to import multiple more .mil datasets - 2020_eot; dotmil_domains; gov_man_22; oira, omb_idea; dap, other websites~ 4) ~use the ignore lists that are in use for the .gov domain for the .mil domain, too~ 5) ~create snapshots for the .mil files just as much as the .gov ones~

Moved these to: