Ability to add metadata to crawl queue

GoogleCodeExporter commented 9 years ago

I am currently using Abot to crawl a CMS-site. The data from the crawls are 
used to monitor site status, as well as generating a report of failing links so 
they can be fixed by the owner of the site. 

However, in order generate this report, I need to know more about the pages to 
be crawled (such as link text for the anchor pointing to the page, and whether 
or not the link points to an image). I have modified my own code to crawl 
ILinkInfo objects instead of Urls. Would this be something that can be included 
into the main source? I can implement it and do a pull request to github if 
this would be nice to have in the main branch.

Original issue reported on code.google.com by d.st...@gmail.com on 18 Dec 2013 at 12:17

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I love the idea of having more information about the links but am hesitant on 
adding anymore parsing than needs to happen since most people wouldn't need the 
link text or need to know if the link was an image. Can you first attach the 
impl that actually fills/returns the list of ILinkInfo object so I can take a 
quick look? 

Thank you for offering your code!!!!

Original comment by sjdir...@gmail.com on 18 Dec 2013 at 9:40

Changed state: Accepted
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

My code is here, I have changed the HyperLinkParser to return a list of 
ILinkInfo, instead of the Uri-list that is returned now. In addition I have 
changed the interface for the PageRequester to crawl PageToCrawl objects 
directly instead of the Uri-object it currently accepts. When I want to crawl 
extra metadata I can then subclass ILinkInfo and update my own HyperLinkParser 
accordingly. The only remaining implementation would be to implement something 
like a PageToCrawl.Bag for storing the metadata. I have done this the ugly way 
locally (By just modifying the PageToCrawl class), so I am not sharing that 
code. Also, I have not updated the CsQueryHyperLinkParser, as I am using the 
HAP-parser:)

I dont know if this is the best way of implementing the described 
functionality, but I have made an attempt at least, so just let me know if you 
like it :) I havent tested it, but I assume it will work just fine :)

Modified files are attached.

Original comment by d.st...@gmail.com on 19 Dec 2013 at 7:53

Added labels: ****
Removed labels: ****

Attachments:

LinkInfo.zip

GoogleCodeExporter commented 9 years ago

fyi, v1.2.3 already has a PageToCrawl.PageBag of dynamic expando type.

I'll take a look at your impl and get back to you. Thanks again.

Original comment by sjdir...@gmail.com on 19 Dec 2013 at 6:04

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

As of right now, i don't think I will pull your changes into the product due to 
the reasons I stated above. However, i may change my position in the future. 
Thanks for offering your implementation. Your time is appreciated.

Original comment by sjdir...@gmail.com on 30 Dec 2013 at 3:12

Changed state: WontFix
Added labels: ****
Removed labels: ****

divyang4481 / abot

Ability to add metadata to crawl queue #122