Open chfoo opened 10 years ago
An option like --follow-links {html,javascript,xml}
might work better.
Edit: --link-extractors
option was used instead.
PDF could be pulled off with "pdfplumber". It allows you to get a json (?) output that contains all the hyperlinks.
this is the pdf used:
Heritrix has some nice URL scraping routines that may be useful.
An option like
--extract-all-links
or either--no-extract-all-links
should be provided.<enclosure>
)