ahmetaa / zemberek-nlp

NLP tools for Turkish.
Other
1.15k stars 210 forks source link

Exceeded 32bits when ngram count * blocksize is bigger than Integer.MAX_VALUE #176

Closed bojie closed 6 years ago

bojie commented 6 years ago

In lm module, when reading a big ngram which count is more than Integer.MAX_VALUE/blockSize, it will exceed the length of int, and crash happened.

From the code, in constructer of GramDataArray:

    while (l < count * blockSize) {
      pageCounter++;
      l += (pageLength * blockSize);
    }
    data = new byte[pageCounter][];
    int total = 0;
    for (int i = 0; i < pageCounter; i++) {
      if (i < pageCounter - 1) {
        data[i] = new byte[pageLength * blockSize];
        total += pageLength * blockSize;
      } else {
        data[i] = new byte[count * blockSize - total];
      }
      dis.readFully(data[i]);
    }

would be corrected to:

while (l < (long)count * blockSize) {
  pageCounter++;
  l += (pageLength * blockSize);
}
data = new byte[pageCounter][];
long total = 0;
for (int i = 0; i < pageCounter; i++) {
  if (i < pageCounter - 1) {
    data[i] = new byte[pageLength * blockSize];
    total += pageLength * blockSize;
  } else {
    data[i] = new byte[(int)((long)count * blockSize - total)];
  }
  dis.readFully(data[i]);
}

Please check.

Br Bojie

ahmetaa commented 6 years ago

Thanks Bojie, nicely spotted. Also, try entropy pruning to reduce gram counts if applicable (with SRILM etc).