BurntSushi / erd

Translates a plain text description of a relational database schema to a graphical entity-relationship diagram.
The Unlicense
1.8k stars 154 forks source link

Problem with non US-ASCII characters #99

Closed johanstrand closed 3 years ago

johanstrand commented 3 years ago

I get the error message hGetContents: invalid argument (invalid byte sequence) When I try a ERD specification with Swedish characters Å, Ä or Ö. I do this using Kroki (https://kroki.io/), but I suspect the issue is with erd since other diagram types, for instance GraphWiz, works. A quick google points to using hSetEncoding to avoid this problem.

kukimik commented 3 years ago

I've managed to reproduce the bug by building the dockerfile used by kroki and then passing a file containing non-ASCII characters:

docker run -v ~/file.er:/file.er cb9853b485bc /root/.local/bin/erd -i /file.er
mmzx commented 3 years ago

Creating an image using the above mentioned Dockerfile by @kukimik I was able to reproduce the issue. However, using the following erd file:

[Person]
*nameÄÖ
height
weightÖ_Ξξ
`birth date ÄÖÄÖÄÖÄÖÄÖ`
+birth_place_id

[`Birth Place`]
*id
`birth city`
'birth state'
"birth country"
"lambda:  λ λ λ λ λ λ"

Person *--1 `Birth Place`

which contains additional Unicode characters: lambda and some other Greek ones ( ;) ) results the expected output when erd is freshly compiled from source and executed on the same system where it was compiled.

The container is very helpful to check further whether the recommended way to fix this works.

kukimik commented 3 years ago

https://serokell.io/blog/haskell-with-utf8 is an interesting read.

I may try to fix this, maybe this week.

kukimik commented 3 years ago

Also, I've found that the problem does not come from the build environment. I can reproduce it using erd compiled on my machine using stack. I just need to change the current locale (I'm on Linux and using @mmzx's example file):

$ LANG=C erd -i file.er

erd: file.er: hGetContents: invalid argument (invalid byte sequence)

I've tried the simplest solution using with-utf8 (i.e. main = withUtf8 $ do ...) and it seems to work ok with -i file.er. The output (both written to files and to stdout) looks ok. However the following:

$ LANG=C erd < examples/simple.er 

fails with:

"<stdin>" (line 2, column 6):
unexpected '\65533'
expecting attribute

I was never strong with encodings; I need to understand what is going on here and what is the expected behaviour.

mmzx commented 3 years ago

I've also started looking into it. Right now this place comes to my mind where setting the encoding is used for the very same purpose.

Later tonight I will give it a try.

mmzx commented 3 years ago

So far... I've just tried these experimentally.

When the LANG environment variable gets unset it fails. Indeed, the LANG variable is not set in the above mentioned docker image when using the bash shell.

Perhaps I shall read the article first about with-utf8 package. :)

There is an alternative way as I recall, but that involves the use of Data.Text.IO from text package which itself has an utf8 encoding function...