Extract all the URLs - Githubissues

ArchiveTeam / wpull

Wget-compatible web downloader and crawler.

GNU General Public License v3.0

557 stars 77 forks source link

Open chfoo opened 10 years ago

chfoo commented 10 years ago

Heritrix has some nice URL scraping routines that may be useful.

An option like --extract-all-links or either --no-extract-all-links should be provided.

[x] JavaScript
[ ] Microsoft DOC file
[ ] PDF file
[ ] Flash SWF file
[ ] RSS & Atom feed (currently treats them as HTML but only grabs mostly HTML stuff; it should be able to handle tags such as <enclosure>)

chfoo commented 10 years ago

An option like --follow-links {html,javascript,xml} might work better.

Edit: --link-extractors option was used instead.

TheTechRobo commented 3 years ago

PDF could be pulled off with "pdfplumber". It allows you to get a json (?) output that contains all the hyperlinks.

this is the pdf used: