datatogether / archivertools

Python package for scraping websites into the Data Together pipeline via morph.io
GNU Affero General Public License v3.0
6 stars 1 forks source link

Porting `extract_href` tool into archivertools #6

Open jeffreyliu opened 7 years ago

jeffreyliu commented 7 years ago

@b5 mentioned that the extract_href tool would be a good fit within archivertools and I agree. The tool automatically scans an HTML page for links and outputs them to a file - it makes sense for us to automatically run this in the constructor of Archiver, and call Archiver.addUrl() on each of the outputs of the function.

It is currently implemented in Go, so we will need to port to Python.

ebenp commented 7 years ago

Here's my attempt at this: https://gist.github.com/ebenp/900cea9b3f3c3b1c747667e831303555#file-extract_href

ebenp commented 6 years ago

Updated gist to extract href urls. I don't get the same number of duplicates as the go script and missing the urls of https://www.epa.gov/, which there are 3.