NAMD / mediacloud_backend

MediaCloud backend repository
2 stars 5 forks source link

Delete directories created by httrack after scanning a URL #23

Closed turicas closed 10 years ago

turicas commented 11 years ago

I think the function url_scanner (in capture/urlscanner.py) is not properly deleting files httrack created, since I'm running capture/extract_feeds.py for hours and there are more than 100 directories here (named as Web site addresses).

Probably the call that removes url with rm -rf is incorrect (maybe httrack is creating another directory or you should pass the absolute URL instead of the complete one).

Note: is really needed to use httrack? Are there options we can pass to it to receive on its output the name of files created (so we can know the filenames for sure before deleting)?

fccoelho commented 11 years ago

Yes, this is quite annoying, but I couldn't find a way to get the urls directly via stdout so I am resorting to parsing the file it generates. After it runs we need to run rm -rf .br, rm -rf hts

To replace httrack we need to find another crawler, which:

  1. Scans urls recursively to a specified depth. (resolving javascrit generated links too)
  2. Returns only the urls visited instead of downloading them, to save bandwidth.
  3. Does that using multiple threads for speed
  4. Can handle timeouts and do retries gracefully.
  5. Returns the urls directly on stdout, without generating files on disk.

Maybe we can code our own in Python but until then...

turicas commented 11 years ago

Is there any way to specify where httrack will save the files? If yes, we can create a temporary directory, store everything there and in the end delete this whole directory. We can use the module tempfile and shutil (function rmtree) for this job.

fccoelho commented 11 years ago

I think so. I'll check the man page and fix this.

fccoelho commented 10 years ago

done