PolMine / GermaParl

GermaParl R Data Package
12 stars 3 forks source link

result is not equivialent when using UTF-8 instead of latin1 #11

Closed KevinGlock closed 3 years ago

KevinGlock commented 5 years ago

Hi,

I created a partition from GermaParl

coi <- partition("GERMAPARL",
                   interjection= F,
                   encoding = "UTF-8",
                   p_attribute = c("word", "lemma"),
                   role = c("mp", "government"))

when I used kwic()

kwic(coi, query = '".*[Aa]us.*bürger.*"')

R returns an warning message: ... getting corpus positions ... no matches for query (or no matches left after applying stoplist/positivelist) NULL Warning message: In .local(.Object, ...) : No hits for query ".*[Aa]us.*bürger.*" (returning NULL)

Instead of using UTF-8 I used the latin1 encoding and the result shows 73 hits ... getting corpus positions ... number of hits: 73 ... checking that all p-attributes are available ... getting token id for p-attribute: word ... generating contexts.

This is a problem when using further workflows for highlighting text as well as for reading it because of the encoding.

ablaette commented 3 years ago

I think this is based on a misunderstanding of the argument encoding. You can use it to state the encoding of the corpus (latin-1 in the case of GermaParl). Stating an encoding that is different from what the corpus "really" has necessarily leads to broken output.