Open jeffreyliu opened 7 years ago
Here's my attempt at this: https://gist.github.com/ebenp/900cea9b3f3c3b1c747667e831303555#file-extract_href
Updated gist to extract href urls. I don't get the same number of duplicates as the go script and missing the urls of https://www.epa.gov/, which there are 3.
@b5 mentioned that the
extract_href
tool would be a good fit within archivertools and I agree. The tool automatically scans an HTML page for links and outputs them to a file - it makes sense for us to automatically run this in the constructor of Archiver, and callArchiver.addUrl()
on each of the outputs of the function.It is currently implemented in Go, so we will need to port to Python.