PolMine / polmineR

R-package for text mining with the Corpus Workbench (CWB) as backend
49 stars 9 forks source link

Warning: unknown encoding #21

Closed wiertz closed 7 years ago

wiertz commented 7 years ago

Hi, I managed to install the package, point the package to my local cwb registry directory and access the corpora. Yet when doing queries (e.g. kwic or count) I get the following warning message:

In getEncoding(.Object) :Please check encoding in the registry file (charset="..." provides unknown encoding) or provide encoding explicitly"

The corpus is utf8 and has been encoded with cwb-encode option -c utf8. Indeed the corpus registry file does not provide charset="utf8", but contains a template line obviously meant to do this (##:: charset = "utf8"). However, when I uncomment this line to "charset = 'utf8'", the corpus fails to load already when activating rcqp, giving the following error message:

REGISTRY ERROR (/usr/local/share/cwb/registry/spiegel_online): Illegal corpus declaration -- no attributes defined REGISTRY ERROR (/usr/local/share/cwb/registry/spiegel_online): Parse Error.

Seems as if this could be an issue with either rcqp or cwb itself. Is it possible to actually "provide encoding explicitly" to polmineR, as the initial warning suggests, and if so how?

Thanks for your help - I am very much looking forward to seeing this package develop!

Best, Thilo

PolMine commented 7 years ago

Hi Thilo,

thanks for raising the issue. It would be best if you could post the registry file of your copus, so that I can try to reproduce the error.

The getEncoding-method reads in the registry file, identifies the line that defines the encoding / charset, and checks whether it is an available encoding:

encoding <- sub('^.*charset\\s*=\\s*"(.+?)".*$', "\\1", encodingLine)
encoding <- toupper(encoding)
if (!encoding %in% iconvlist()){
    warning('Please check encoding in the registry file (charset="..." provides unknown encoding) or provide encoding explicitly')
}

To understand the issue, you can call the procedure used to get the encoding of the corpus directly. What do you see when you run the following commands:

foo <- RegistryFile$new(corpus = "YOURCORPUS")$getEncoding()
print(foo)

and / or

getEncoding("YOURCORPUS")

Finally, please make sure that you have installed the latest version polmineR from the development branch:

devtools::install_github("PolMine/polmineR", ref = "dev")

Andreas

wiertz commented 7 years ago

Hi Andreas,

thanks for your response. I think I found the problem: RegistryFile$new(corpus = "YOURCORPUS")$getEncoding() returns "UTF8", but iconvlist() only contains "UTF-8". I tried to manually change the registry file from utf8 to utf-8, but then cqp throws an error.

I can't update through devtools as the package fails to install (devtools requires git2r, git2r install fails because of "Unsupported architecture"...). R shows version 0.7.3.

Thilo

PolMine commented 7 years ago

Hi Thilo, it's still odd. "UTF8" %in% iconvlist() is TRUE, so that's not the problem. I just wondered whether you might work with windows? Line endings may be messed up (from the perspective of CQP) when you edit the registry file. One potential solution is to read the registry file, and saving it again:

library(polmineR)
R <- RegistryFile$new("YOURCORPUS")
R$write()

Let me know whether that works. Andreas

wiertz commented 7 years ago

Actually I checked: "UTF8" %in% iconvlist() in my case is FALSE. The help of inconvlist has the explanation:

The names of encodings and which ones are available are platform-dependent.

I'm on Mac, so probably that makes the difference.

PolMine commented 7 years ago

Ok, but that implies that file endings ("\n" on Mac/Linux, and "\r\n" on windows) are not the problem. Would you mind to share your registry file? When parsing the registry file, the corpus library can be fairly picky.

wiertz commented 7 years ago

Below is the registry file. But as far as I can see the different UTF-namings explain the warning: the registry file is parsed correctly, getEncoding extracts "utf8", but since on my computer "UTF8" %in% iconvlist() is FALSE, throws a warning. CQP seems to be happy with "utf8", but not with "utf-8", so if I manually change the registry file from "utf8" to "utf-8" I get an error by cqp. I guess that ideally getEncoding would handle the different UTF-naming conventions.

My original message may be a bit confusing as I first tried something else which leads to a parsing error. Sorry if I wasn't quite clear on this. The original encoding line in the registry file reads

##::charset = "utf8" # character encoding of corpus data

And I assumed that "##" means the encoding info is not read. My first attempt was thus to uncomment the line (including or excluding also the "::"). However, CQP seems to expect the info to be in comments. It gives a parsing error if the comment is removed and the corpus is not made available at all. So I changed the line back to its original state and tried RegistryFile$new(corpus = "YOURCORPUS")$getEncoding() as you suggested. This correctly returns "UTF8" (as the regex expression in getEncoding does not care about comments), but this is not recognised, as "UTF8" is not in my iconvlist().

##
## registry entry for corpus AFD
##

# long descriptive name for the corpus
NAME ""
# corpus ID (must be lowercase in registry!)
ID   afd
# path to binary data files
HOME /usr/local/share/cwb/data/afd
# optional info file (displayed by "info;" command in CQP)
INFO /usr/local/share/cwb/data/afd/.info

# corpus properties provide additional information about the corpus:
##::charset  = "utf8" # character encoding of corpus data
##:: language = "??"     # insert ISO code for language (de, en, fr, ...)

##
## p-attributes (token annotations)
##

ATTRIBUTE word
ATTRIBUTE pos
ATTRIBUTE lemma

##
## s-attributes (structural markup)
##

# <s> ... </s>
STRUCTURE s

# <p> ... </p>
STRUCTURE p

# <text id=".." source=".." file=".." url=".." date=".." year=".."> ... </text>
# (no recursive embedding allowed)
STRUCTURE text
STRUCTURE text_id              # [annotations]
STRUCTURE text_source          # [annotations]
STRUCTURE text_file            # [annotations]
STRUCTURE text_url             # [annotations]
STRUCTURE text_date            # [annotations]
STRUCTURE text_year            # [annotations]

# Yours sincerely, the Encode tool.
PolMine commented 7 years ago

As it seems, stating 'utf8' is not necessarily portable to mac systems (see documentation for iconvlist, ?iconvlist). To make the getEncoding method more robust, I included a new line that will convert 'utf8' to 'utf-8'. Hope it works now. Please get the slightly updated polmineR version (devtools::install_github("PolMine/polmineR", ref = "dev") to check the bug fix.

I still think that an alternative should have been to state the encoding in the corpus properties as follows:

:: charset = "utf-8"

When parsing your file with the RegistryFile class (a worker to extract function in the polmineR package), I realized that the whitespace signs in the sample registry file you provided deviate somewhat from the standard format. I changed the regular expression so that it will swallow messed up whitespace signs more robustly, but it's something you might consider as a potential source of errors.

Please let me know whether the encoding errors keeps to occurr! (I hope it doesn't.)

wiertz commented 7 years ago

Hi Andreas,

with the update to polmineR it now works without warnings:

> count("AFD", '".+(K|k)rise"', cqp=T)
query count         freq
1: ".+(K|k)rise"   170 0.0002121431

Just to test I also tried again to change the reg file to ##:: charset = "utf-8". While the polmine package successfully reads the info, cqp seems to be troubled. The same query from above yields:

> count("AFD", '".+(K|k)rise"', cqp=T)
CL: Error, unrecognised CorpusCharset in cl_string_validate_encoding.
CQP Error:
    Query includes a character or character sequence that is invalid
in the encoding specified for this corpus
Fehler in rcqp::cqi_query(corpus, "Hits", query) : 
  cqp returned error code #1281

After changing the line back to the cqp original ##:: charset = "utf8" everything works fine.

Thanks a lot for your help!

Thilo