PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

pad output names of segment() #18

Closed ChristophLeonhardt closed 3 years ago

ChristophLeonhardt commented 3 years ago

I would propose a small change to the following part of the segment() function of the javamultithreading branch because otherwise the order of CoNLL-files might easily get mixed up when they are read back in later on:

      f <- file.path(file.path(dir, i, sprintf("%d.txt", chunks[[i]][["id"]][j])))

in segment() could be changed to pad the number in the filename with zeros according to the length of chunksize such as:

      f <- file.path(file.path(dir, i, sprintf("%0*d.txt", nchar(chunksize),  chunks[[i]][["id"]][j])))

I think this should make it easier to sort by filename to ensure the correct order.

Edit: The number of characters in chunksize was chosen because it seemed that the scenario I was describing will occur mostly in the very first chunk (which will be prevented by the approach above), but that doesn't need to be true. A more universal solution would be to pad all file names according to the nchar of the final ID in the input table.

ablaette commented 3 years ago

After the proliferation of branches, we are back on the dev branch for recent activity.

Good point, your suggestion is included. Thinking about your afterthought on nchar(chunksize) I think indeed that the nchar of the maximum id is the correct solution and this is what I implemented now.

This approach requires that the id is an integer value, and I included a respective check on the column and amendedd the documentation accordingly. I can imagine scenarios when it is more practical to work with a character vector of unique ids. It would be possible to implement this with minor modifications. Would be interesting to hear what you think.