dakrone / itsy

A threaded web-spider written in Clojure
181 stars 30 forks source link

Domain limiting #3

Open clojens opened 11 years ago

clojens commented 11 years ago

Hey dakrone, you mention Itsy domain limiting capabilities, can you elaborate? In this case, I'd like to e.g. extract only pages->text which have a certain domain pattern. Of course I can hack this in somewhere but I was wondering if Itsy has something like that. Perhaps you know the solution frak but in case you didn't, it might be of interest. Thanks for all your work.

Cheers, Rob (supersym)

dakrone commented 11 years ago

Sure, the host limiter allows you to limit the URLs that Itsy fetches based on a hostname.

By specifying the :host-limit option as true, Itsy limits the URLs corresponding to the host of the original seeding URL (so if you specify http://example.com/foo, itsy would limit to example.com). By specifying a string as the :host-limit, itsy will match URLs whose host contain that string.

Hopefully that helps explain it a bit more. It would be neat to use frak and parse the URL, I'll have to keep that in mind for the future, thanks!