Closed thomasw21 closed 2 years ago
I've modified a few things:
The reason why I use the save path existence method is I'm thinking that I don't want to relaunch unecessary jobs on JZ when I launch my huge array. In order to do so I use that heuristics to determine if a previous job has already done the job.
I've added a --from-scratch
argument. Please use it when you develop so that you don't have to bother regenerating it all the time.
Deduplication works. Now time to finetune the parameters. Tested on
bigscience-catalogue-lm-data/lm_fr_pseudocrawl-filtered_530_www_mediapart_fr