PolMine / bignlp

Tools to process large corpora line-by-line and in parallel mode
1 stars 1 forks source link

corenlp_parse_conll | Hashtags in input cause trouble in read.table #27

Open ChristophLeonhardt opened 3 years ago

ChristophLeonhardt commented 3 years ago

Problem

If there is a literal hashtag ("#") in the input of corenlp_parse_conll(), read.table will treat it as a comment.

https://github.com/PolMine/bignlp/blob/872ff58c489c994c28395d4347deef6376915245/R/output.R#L117

Example

Let the sentence "Some unannotated example text with a hashtag #HashtagExample" be the input for core_parse_conll(). If it would really be generated by bignlp this would look like:

x <- "1\tSome\t_\t_\t_\t_\t_\n2\tunannotated\t_\t_\t_\t_\t_\n3\texample\t_\t_\t_\t_\t_\n4\ttext\t_\t_\t_\t_\t_\n5\twith\t_\t_\t_\t_\t_\n6\ta\t_\t_\t_\t_\t_\n7\thashtag\t_\t_\t_\t_\t_\n8\t#HashtagExample\t_\t_\t_\t_\t_\n\n"

This will cause trouble because the number of columns gets mixed up because the "#HashtagExample" is treated like a comment instead of an ordinary character vector.

dt <- as.data.table(
  read.table(text = x, blank.lines.skip = TRUE, header = FALSE, sep = "\t", quote = "")
)

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 8 did not have 7 elements

Potential Solution

I guess, the solution is to turn off the comment.char altogether in read.table()

dt <- as.data.table(
  read.table(text = x, blank.lines.skip = TRUE, header = FALSE, sep = "\t", quote = "", comment.char = "")
)