Closed ChristophLeonhardt closed 3 years ago
After the proliferation of branches, we are back on the dev branch for recent activity.
Good point, your suggestion is included. Thinking about your afterthought on nchar(chunksize)
I think indeed that the nchar of the maximum id is the correct solution and this is what I implemented now.
This approach requires that the id is an integer value, and I included a respective check on the column and amendedd the documentation accordingly. I can imagine scenarios when it is more practical to work with a character vector of unique ids. It would be possible to implement this with minor modifications. Would be interesting to hear what you think.
I would propose a small change to the following part of the
segment()
function of the javamultithreading branch because otherwise the order of CoNLL-files might easily get mixed up when they are read back in later on:in
segment()
could be changed to pad the number in the filename with zeros according to the length ofchunksize
such as:I think this should make it easier to sort by filename to ensure the correct order.
Edit: The number of characters in
chunksize
was chosen because it seemed that the scenario I was describing will occur mostly in the very first chunk (which will be prevented by the approach above), but that doesn't need to be true. A more universal solution would be to pad all file names according to the nchar of the final ID in the input table.