Open amrragab8080 opened 3 years ago
Tried multiple wiki dumps all result in the same error (just after a different number of processed docs)
I was able to diagnose the problem the issue is that this statement in your instructions is not thread-safe with certain file systems. parallel cats of the json was causing invalid json to be in the merged file.
find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 70 "cat {} >> mergedfile.json"
you can replace it with
find /lustre/data/wiki/text/ -name wiki* | parallel -m -j 1 "cat {} >> mergedfile.json"
without impacting performance.
@amrragab8080 thanks for reporting. Could you submit a PR with the fix?
Running through the data prep stages I am getting an error during the binary file creation step. command line:
error: