parsing HTML to grab links, resources

nwtn commented 11 years ago

It would be possible to use regex to try to find anchors, CSS, and JS, but this could end up being very messy. I'd suggest using an HTML-parsing library but, since Python is super new to me, I don't know which one might be most appropriate.

Googling led me to Beautiful Soup. Any other suggestions?

yoavweiss commented 11 years ago

:+1: for Beautifulsoup. Feel free to bug me on irc/mail with questions/etc On Nov 3, 2013 9:51 PM, "David Newton" notifications@github.com wrote:

It would be possible to use regex to try to find anchors, CSS, and JS, but this could end up being very messy. I'd suggest using an HTML-parsing library but, since Python is super new to me, I don't know which one might be most appropriate.

Googling led me to Beautiful Souphttp://www.crummy.com/software/BeautifulSoup/. Any other suggestions?

— Reply to this email directly or view it on GitHubhttps://github.com/Webdevdata/fetcher/issues/5 .

mfaure commented 11 years ago

we use Heritrix since 4 years, and we're really happy with it. Heritrix is the crawler used for (developed by ?) archive.org. It is industrial-grade, it deals easily with server-side issues (throttling and so on), and is highly configurable. You can even create templates of configuration.

And as features, it grabs HTML, CSS, JS (tested) and Flash (not tested yet).

yoavweiss commented 11 years ago

Very interesting! It's true that moving from "fetcher" to "crawler" requires a lot of logic and it might be better to use existing tools. I'll look into it. On Nov 4, 2013 7:51 AM, "Matthieu FAURE" notifications@github.com wrote:

we use Heritrix https://github.com/internetarchive/heritrix3 since 4 years, and we're really happy with it. Heritrix is the crawler used for (developed by ?) archive.org. It is industrial-grade, it deals easily with server-side issues (throttling and so on), and is highly configurable. You can even create templates of configuration.

And as features, it grabs HTML, CSS, JS (tested) and Flash (not tested yet).

— Reply to this email directly or view it on GitHubhttps://github.com/Webdevdata/fetcher/issues/5#issuecomment-27666816 .

marcoscaceres commented 11 years ago

On Monday, November 4, 2013 at 7:26 AM, Yoav Weiss wrote:

Very interesting! It's true that moving from "fetcher" to "crawler"
requires a lot of logic and it might be better to use existing tools. I'll
look into it.

Agree… this could be good too.

nwtn commented 11 years ago

Heritrix is Java, which is not something I'm entirely comfortable with. Does anybody else want to take this on? Or, shall I continue w/ Python (BeautifulSoup) for now?

marcoscaceres commented 11 years ago

It does seem kinda silly to build our own, tbh. I'd be happy if we could integrate Heritrix... though I need to test it on my Mac. There are a lot of Java issues with Macs these days.

mfaure commented 11 years ago

Perhaps we could bring some help. First, install and configure Heritrix (on a linux box)

https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+Installation https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+Configuration

Next, run Heritrix with the suitable "profile". One profile = one website ; and one job = one crawl of a given profile. The idea is to have a template of profile that could be filled in by a shell/python script in order to with complete the profile with the entries of the .csv

We could even have different templates, let say one to fetch a single page, a second one to fetch a page and its resources (css, js, flash...), and a third one to fetch N pages. (We are preparing the 3 profiles from the ones we already have)

Heritrix runs as a service, and has got an API.

The shell/python script could call the suitable function (https://webarchive.jira.com/wiki/display/Heritrix/Heritrix+3.x+API+Guide#Heritrix3.xAPIGuide-SubmittingaCXMLJobConfigurationFile) to run a fetch for each of the URLs in the csv file.

What do you think about it ?

oli commented 10 years ago

It’d be great if the Fetcher followed meta redirects too, something I assume Heritrix already does.

Webdevdata / fetcher

parsing HTML to grab links, resources #5