18F / site-scanning

The code base for the first Site Scanning engine
https://digital.gov/site-scanning
18 stars 9 forks source link

Reference one domain list instead of multiple? #424

Closed anjunainaustin closed 4 years ago

anjunainaustin commented 4 years ago

As a user or staff, I would like to be able to know that the domain list being scanned is the most comprehensive available in TTS.

If https://pulse.cio.gov/https/domains/ is that list, can we use that?

Acceptance Criteria

  1. Site Scanner pulls directly from one domain list csv file so that we do not need to continuously update our list to match that domain list.
  2. If Pulse is the list, then we state that clearly on the website along with explanation of it's limitations.

Also, here are a few points that need to be addressed:

  1. Some folks in TTS feel that TTS Solutions should all use the same list
  2. Different Solutions are using different domain lists, and it is unclear why.
  3. We have been asked to note the discrepancies we've discovered through our work, and will still need to determine if this is possible to do in the scope of our project.
anjunainaustin commented 4 years ago

@timothy-spencer making an issue of the question I posed in slack so I didn't lose track of this that's all.

timothy-spencer commented 4 years ago

We do incorporate the pulse list: https://github.com/18F/site-scanning/pull/413 We also list it as one of our sources on https://site-scanning.app.cloud.gov/ Every time we do a scan, we pull down the latest version of the list of domains documented on that page.

Our stance is that there is no canonical list. It always will be changing. So we are grabbing lists from as many sources as we can get and de-duping them and scanning them all. More data better! People can always filter out data that they don't need. I wouldn't be surprised if we had a better list than anybody else these days. Or maybe not "better", but we probably have more domains than anybody else.

anjunainaustin commented 4 years ago

@timothy-spencer I'm going to reopen this because I'm actually referring to only using Pulse. I'm hoping we can also think through the pros/cons of stitching together multiple lists in light of risks to Site Scanner's perceived accuracy.

timothy-spencer commented 4 years ago

Our accuracy is 100%. :-) For the domains we scan, we get results for every one of them. :-) For the domains that are live, we get back response codes. For the ones that are not, we get back -1.

I suspect that some people may say "whoa, I don't care about all of these subdomains! They make it hard for me to find the domains that I really care about!". That's fine. Those folks can talk with us about filtering. But it is better to have more domains, and have to filter them down than it is to not have domains that people want to see. If you recall, a few weeks ago, we had many folks clamoring for us to get subdomains working. :-)

What we are scanning is the most comprehensive list of sites that I know of. If we just use Pulse, we will reduce the usefulness of our system, because:

  1. Pulse's list seems not to have been updated for ~3 years.
  2. Pulse's list had sites which were redirected or inactive removed from them. That is good data too. If they become live, people will want to know about them.
  3. Pulse does not have all of the metadata that the current-federal.csv list has, which seems to be updated regularly and has been very valuable in interpolating metadata into other domains.