PolMine / cwbtools

Tools to create and manage CWB-indexed corpora
4 stars 2 forks source link

2^31 - 1 Limitation of writeBin() to write long integer vector #28

Closed ablaette closed 3 years ago

ablaette commented 3 years ago

There is a limitation of writeBin() I had not anticipated when writing a large vector to disk:

Fehler in writeBin(object = ids, size = 4L, endian = "big", con = corpus_file) : lediglich 2^31 -1 Bytes können in einem einzelnen Aufruf von writeBin() geschrieben werden

ablaette commented 3 years ago

This is a little caveat what to consider when elaborating the solution.

x_max <- 798424101
x <- 1:x_max

max_id <- ((2^31 - 1) / 4) - 1

tail(x[536870911:length(x)])
tail(x[(max_id + 1L):length(x)])
tail(x[seq.int(from = max_id + 1L, to = length(x), by = 1L)])

You'd expect the last three lines to have the same result. But ...

ablaette commented 3 years ago

See this: https://github.com/paws-r/paws/issues/242 And this: https://github.com/HenrikBengtsson/Wishlist-for-R/issues/97

ablaette commented 3 years ago

Starting with R version 4.0.0, long vectors are supported, see https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

Something like this could be inserted in the code:

if (getRversion() < R_system_version("4.0.0")){
  max_id <- ((2^31 - 1) / 4) - 1
  if (length(x) > max_id) warning("writing will fail, update to R 4.0.0 or higher ")
} 
ablaette commented 3 years ago

The p_attribute_encode() function will now check for the R version and stop with an informative message if the R version is below R 4.0.0 and unable to write the token stream of a large corpus to disk.