pad output names of segment()

PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode

1 stars 1 forks source link

I would propose a small change to the following part of the segment() function of the javamultithreading branch because otherwise the order of CoNLL-files might easily get mixed up when they are read back in later on:

      f <- file.path(file.path(dir, i, sprintf("%d.txt", chunks[[i]][["id"]][j])))

in segment() could be changed to pad the number in the filename with zeros according to the length of chunksize such as:

      f <- file.path(file.path(dir, i, sprintf("%0*d.txt", nchar(chunksize),  chunks[[i]][["id"]][j])))

I think this should make it easier to sort by filename to ensure the correct order.

Edit: The number of characters in chunksize was chosen because it seemed that the scenario I was describing will occur mostly in the very first chunk (which will be prevented by the approach above), but that doesn't need to be true. A more universal solution would be to pad all file names according to the nchar of the final ID in the input table.

PolMine / bignlp

pad output names of segment() #18