datatogether / archivertools

Python package for scraping websites into the Data Together pipeline via morph.io
GNU Affero General Public License v3.0
6 stars 1 forks source link

Archiver UUID and url - are they still relevant? #25

Open jeffreyliu opened 6 years ago

jeffreyliu commented 6 years ago

Currently the constructor to Archiver takes two arguments: UUID and url. I'm wondering if they're still relevant. The UUID is a holdover from the archivers.space workflow - I'm not sure if there's an analog in DT? I think we can just remove this. Also, in #5, we discussed that a custom crawl could span multiple urls. This raises a couple questions:

ebenp commented 6 years ago

I'm not sure the UUID is as relevant here as it was in the archivers-space since we are using the archivers tool on either child urls or byte files,rather than individual url pages. Maybe the UUID should be the scraper root url or removed altogether.

In regards to custom crawlers spawning multiple urls it seems that a scraper has to begin at some url. Maybe that's our root url as a reference for future scraper runs and gets set with the archiver tool initialization? For distinguishing the type of url collected maybe data collection urls should be passed inthe add data function and child url are maintained through the add url function?

b5 commented 6 years ago

So, you know, it's only been four months, but yes UUID's should be ignored whenever possible. I'd favor hashes for blob content and urls for anything that has a clear association to... a URL.