GSA / site-scanning

The central repository for the Site Scanning program
https://digital.gov/site-scanning
11 stars 2 forks source link

redress duplicates making it into the target URL list #966

Closed gbinal closed 3 weeks ago

gbinal commented 1 month ago

From here.

Looks like 2 issues:

HEPIS.ed.gov hepis.ed.gov WWW.grants.gov www.grants.gov www.DEAdiversion.usdoj.gov www.deadiversion.usdoj.gov www.dote.osd.mil www.airdomainintelligence.mil www.norad.mil www.dote.osd.mil www.jba.af.mil www.jbcharleston.jb.mil www.jble.af.mil www.jbmdl.jb.mil www.norad.mil www.jba.af.mil www.cemm.af.mil www.cemm.af.mil www.jba.af.mil www.jbcharleston.jb.mil www.jble.af.mil www.jbmdl.jb.mil www.airdomainintelligence.mil www.spaceforce.mil www.41fab.army.mil www.41fab.army.mil usapc.army.mil kirk.tricare.mil www.psmagazine.army.mil usapc.army.mil www.psmagazine.army.mil www.tf515.marines.mil www.cd.marines.mil www.marforres.marines.mil www.hqmc.marines.mil www.cd.marines.mil www.hqmc.marines.mil www.trngcmd.marines.mil www.trngcmd.marines.mil www.hqmc.marines.mil www.hqmc.marines.mil www.quantico.marines.mil www.29palms.marines.mil www.quantico.marines.mil www.trngcmd.marines.mil www.hqmc.marines.mil www.trngcmd.marines.mil www.marforres.marines.mil www.hqmc.marines.mil www.29palms.marines.mil www.marcorsyscom.marines.mil www.marcorsyscom.marines.mil www.tf515.marines.mil www.hqmc.marines.mil www.hqmc.marines.mil www.trngcmd.marines.mil www.trngcmd.marines.mil www.hqmc.marines.mil www.trngcmd.marines.mil www.trngcmd.marines.mil www.mynavyhr.navy.mil www.mynavyhr.navy.mil www.airpac.navy.mil www.netc.navy.mil www.netc.navy.mil www.navyreserve.navy.mil www.navyreserve.navy.mil www.airpac.navy.mil www.mycg.uscg.mil www.mycg.uscg.mil al.ng.mil www.spaceforce.mil guthrie.tricare.mil kirk.tricare.mil www.acq.osd.mil www.acq.osd.mil guthrie.tricare.mil al.ng.mil

gbinal commented 1 month ago

The first part is now addressed by forcing all URLs when first ingested anywhere to lowercase.

The latter chunk, all of those .mil sites, should now be addressed by ensuring that the .mil index generation process includes dedupping.

gbinal commented 1 month ago

I'll keep this open until I confirm.