DAMPEEU / DmpTools

Miscellaneous tools written for the DAMPE experiment
0 stars 3 forks source link

Crawler ability to resume a task #20

Open david-droz opened 7 years ago

david-droz commented 7 years ago

Two issues:

  1. The crawler spends a significant amount of time writing to the output dictionary (json file). This means that, should the process be killed, there's a significance chance that the output file is still open at that moment. Resulting in a corrupted file.

  2. The whole crawler method does not know whether or not a file has already been crawled or not. If e.g. we have a task of 35'000 files and 20'000 have already been analysed, then running the crawler again will make it run through all 35'000. Ideally it should only run through the last remaining 15k. In the last pull request I added a script to work around that, but this may not be the optimal solution.