ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
557 stars 77 forks source link

Extract all the URLs #74

Open chfoo opened 10 years ago

chfoo commented 10 years ago

Heritrix has some nice URL scraping routines that may be useful.

An option like --extract-all-links or either --no-extract-all-links should be provided.

chfoo commented 10 years ago

An option like --follow-links {html,javascript,xml} might work better.

Edit: --link-extractors option was used instead.

TheTechRobo commented 3 years ago

PDF could be pulled off with "pdfplumber". It allows you to get a json (?) output that contains all the hyperlinks.

image

this is the pdf used:

image