ericvana / abot

Automatically exported from code.google.com/p/abot
Apache License 2.0
0 stars 0 forks source link

It's great but could not save session (it's very important) #114

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. add ability to save session and load session later.
2. it can be store in sql compact db

If I run crawler and it crawls 2000 page, When I close the program,the crawled 
pages not saved and I should recrawl it later.

It's great, but If you add this ability, the crawler becomes perfect!

Original issue reported on code.google.com by smsghase...@gmail.com on 29 Jul 2013 at 5:53

GoogleCodeExporter commented 9 years ago
Hi,

Firstly, thanks for the great project!

I've customising your code to do just that. I want to share my experience with 
you and the community.

My approach to modify the SchedulePageLinks(CrawledPage crawledPage) method in 
WebCrawler so that it fires off an async event when the link is added to the 
scheduler. The async event will then keep track of scheduled links in the 
database.

Also on my WebCrawler constructor, I initialise the scheduler so that it reads 
from the db the links that were scheduled previously.

Also I added CrawlDecisionMaker to decide whether a link should be scheduled.

Happy to share my code with you guys if you like.

Cheers,
Andrew

Original comment by andrewyo...@gmail.com on 17 Aug 2013 at 2:10

GoogleCodeExporter commented 9 years ago
Hello
Thanks for your effort.
Can you add pause and resume option?
This abilities are very important.
Additionally, also the history check is a good idea (don't crawl pages again)
I'm waiting for your great work...
Thanks a lot.

Original comment by smsghase...@gmail.com on 18 Aug 2013 at 2:35

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 3 Sep 2013 at 2:54

GoogleCodeExporter commented 9 years ago
This task is partially complete since issue 92 was completed. Issue 92 gives 
the ability to store crawled urls to disk. Thats part 1. Part 2 is to store all 
the context (counts, depths, etcs..) along with the url db.

Original comment by sjdir...@gmail.com on 20 Jan 2014 at 4:02

GoogleCodeExporter commented 9 years ago
Moving this to release 2.0 since I dont want to hold up this release any longer

Original comment by sjdir...@gmail.com on 20 Jan 2014 at 4:03

GoogleCodeExporter commented 9 years ago
Thanks for you effort.
I have some Ideas that are not hard to perform but are important.
Unfortunately, I'm not a good programmer that can write these codes

Original comment by smsghase...@gmail.com on 20 Jan 2014 at 2:40

GoogleCodeExporter commented 9 years ago

Original comment by sjdir...@gmail.com on 20 Jan 2014 at 3:50