PolMine / cwbtools

Tools to create and manage CWB-indexed corpora
4 stars 2 forks source link

Non breaking space in as.vrt() #39

Closed ablaette closed 2 years ago

ablaette commented 3 years ago

I encountered a somewhat strange behavior of as.vrt() throwing this warning:

input string ' ' cannot be translated to UTF-8, is it valid in 'ANSI_X3.4-1968'?

The snippet I used to reconstruct the issue was as follows:

cwbtools::as.vrt(xml2::read_xml("<xml>\nHello\u00A0bug!\n</xml>"))

Took me a while to find this out: The warning emerges when cwbtools is installed in Docker container from GitHub before the locale is set. The warning is not issued of cwbtools is installed after the locale is set.

Then, it is apparently this statement of a sign to remove that causes the problem:

 c("\xC2\xA0", ""), # incompatible with XML

It is identical with this statement:

c("\u00A0", " "), # incompatible with XML

And this is why I removed \xC2\xA0 ...

ablaette commented 2 years ago

The report of the issue explains how it was fixed. So I close it.