ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.31k stars 129 forks source link

Allow pausing crawl with DIR/pause #28

Open ivan opened 8 years ago

ivan commented 8 years ago

This is better than kill -STOP pid because

1) it allows grab-site to keep receiving control messages, once we implement those

2) it doesn't require looking up the pid or using pgrep

ivan commented 8 years ago

I don't even know if this is possible to implement nicely (i.e. not breaking any existing responses being downloaded) with the wpull hooks that exist now

12As commented 8 years ago

What about the wait_time hook? I believe that happens at the end after the file has been added to the warc. Here is a basic sketch: (Forgive me for not knowing the proper terminology, so I will use CG to refer to the proper event loop idea of a thread.)

In the wait_time hook, a CG can check if the pause file exists and, if so, set an appropriate locking mechanism, set concurrency to 1 and spin away on an while loop whose condition is if the file exists and if so calls an appropriate sleep function. Other CGs would check for whether the file is there, but skip the section with the lock and return and end. When the pause file is deleted, the CG in the locked section sets the concurrency to what is in the concurrency file and then releases the lock before returning the delay time.