Letractively / flaxcrawler

Automatically exported from code.google.com/p/flaxcrawler
0 stars 0 forks source link

Consider using CrawlerTasks instead of raw URL objects on the Page object #5

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Consider using CrawlerTasks instead of raw URL objects for tasks. This will 
give the crawler the ability to:

1)Download tasks that are post requests to a web forms with specific post 
params.
2)Send requests with custom headers on some websites

and the last and most important:
3)If the Parser wants to add custom information about the url at the moment, 
there is no way to do it, because it sets standard java.net.URL on the Page 
object.

Here is a patch that adds supports for tasks to the Page object. It also keeps 
the information about the seed every task came from.

Original issue reported on code.google.com by nikolavp@gmail.com on 22 Jun 2011 at 10:58

Attachments: