dreisman / WebCensusNotebook

4 stars 1 forks source link

Expose info about crawl errors in first party objects, if firstparty had failed during crawl. #24

Open dreisman opened 7 years ago

dreisman commented 7 years ago

The third party objects are not created if you just try to run a cell with fp.third_parties. You need to actually iterate through to get the objects. This isn't expected behavior.

dreisman commented 7 years ago

Never mind... a few first parties are missing third party data. Possibly from timeouts/errors during the crawl.

dreisman commented 7 years ago

Plan of action: Mark FPs as failures if there is no response data for a given site. Right now we only go off of requests.

dreisman commented 7 years ago

Not sure the best option here.

We could enable user to access FirstParty object for failed crawl site, but when accessing third party data for site raise exception ("Crawl failed to retrieve third party data for given FirstParty").

On one hand I like this because it still says "yes, we acknowledge that the site was in the crawl but a crawl error prevents there from being complete data." The problem with this is that when iterating over cen.first_parties you will necessarily have to catch exceptions for data that we already know doesn't exist. That'll make analysis code to do basic things like iterating over sites and getting their third parties more complicated.

Maybe we can have that, but when iterating you only iterate over sites that are known to have been successfully visited.

englehardt commented 7 years ago

I think it's probably better to prevent these first parties from showing up in the results by default, and allow users to include them if they want to. You could have a parameter in the census constructor called something like include_failures=False. That way users can easily do things like computer the percentage of sites which have some property without worrying about filtering out all of the failed visits. You could also expose a list these failed sites somewhere with the failure type.

If a user sets include_failures=True, then they can be included in the default set. In this case, since the user knows they are including bad data points, it's probably okay to just add a property to the FirstParty object that allows the user to check the failure type. No need to throw any errors.