PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

Potential integer column type of parsed ConLL output #34

Closed ablaette closed 2 years ago

ablaette commented 3 years ago

This is a documentation of an imminent bug fix. Annotating a corpus I saw the error

Error in [.data.table(x, , { : Column 2 of result for group 4421 is type 'integer' but expecting type 'character'. Column types must be consistent for each group.

The error results from a piece of text that is a character vector "-9458". It yields the conll output:

"1\t-9458\t\tNUM\tO\t\t_\n\n"

The chosen approach to parse the ConLL output is using the read.table() function, which guesses the vector type of columns. The second column in this case is an integer vector, causing the error later on.

Easy to fix with argumen colClasses:

dt <- as.data.table(
       read.table(
          text = x,
          blank.lines.skip = TRUE,
          header = FALSE,
          sep = "\t", quote = "", comment.char = "",
          colClasses = c("integer", "character", "character", "character", "character", "character", "character")
        )
      )
)
ablaette commented 2 years ago

Encountered this bug also when inmemory = FALSE. The not very telling error message is:

Error in rbindlist(dts) : 
  Internal error: column 3 of result is determined to be integer64 but maxType=='character' != REALSXP``

Adding an expectation on colClasses to the data.table():: fread() worker helps!

ablaette commented 2 years ago

Has been fixed. For processing the data, this also brings an unexpected performance improvement!