Closed turicas closed 10 years ago
Yes, this is quite annoying, but I couldn't find a way to get the urls directly via stdout so I am resorting to parsing the file it generates. After it runs we need to run rm -rf .br, rm -rf hts
To replace httrack we need to find another crawler, which:
Maybe we can code our own in Python but until then...
Is there any way to specify where httrack will save the files? If yes, we can create a temporary directory, store everything there and in the end delete this whole directory.
We can use the module tempfile
and shutil
(function rmtree
) for this job.
I think so. I'll check the man page and fix this.
done
I think the function
url_scanner
(incapture/urlscanner.py
) is not properly deleting fileshttrack
created, since I'm runningcapture/extract_feeds.py
for hours and there are more than 100 directories here (named as Web site addresses).Probably the call that removes
url
withrm -rf
is incorrect (maybehttrack
is creating another directory or you should pass the absolute URL instead of the complete one).Note: is really needed to use
httrack
? Are there options we can pass to it to receive on its output the name of files created (so we can know the filenames for sure before deleting)?