asciidoctor / asciidoctor-diagram

:left_right_arrow: Asciidoctor diagram extension, with support for AsciiToSVG, BlockDiag (BlockDiag, SeqDiag, ActDiag, NwDiag), Ditaa, Erd, GraphViz, Mermaid, Msc, PlantUML, Shaape, SvgBob, Syntrax, UMLet, Vega, Vega-Lite and WaveDrom.
http://asciidoctor.org
MIT License
436 stars 106 forks source link

ditaa and plantuml error #124

Closed thorstenkampe closed 7 years ago

thorstenkampe commented 8 years ago

asciidoctor-diagram 1.5.1 Asciidoctor 1.5.4 Ruby 2.2.5 on 64-bit Cygwin

ditaa and plantuml diagram generate an error: asciidoctor-diagram: ERROR: Failed to generate image: invalid byte sequence in UTF-8.

For instance

[ditaa]
----
                   +-------------+
                   | Asciidoctor |-------+
                   |  Diagram    |       |
                   +-------------+       | PNG out
                       ^                 |
                       | ditaa in        |
                       |                 v
 +--------+   +--------+----+    /----------------\
 |        | --+ Asciidoctor +--> |                |
 |  Text  |   +-------------+    |Beautiful output|
 |Document|   |   !magic!   |    |                |
 |     {d}|   |             |    |                |
 +---+----+   +-------------+    \----------------/
     :                                   ^
     |          Lots of work             |
     +-----------------------------------+
----

or

[plantuml]
--
class BlockProcessor
class PlantUmlBlock
BlockProcessor <|-- PlantUmlBlock
--
pepijnve commented 8 years ago

I just tried reproducing this on my machine, but I'm not seeing this error. Could you run locale in your cygwin terminal and add the output of that in a comment here?

thorstenkampe commented 8 years ago
LANG=de_DE.utf8
LC_CTYPE="de_DE.utf8"
LC_NUMERIC="de_DE.utf8"
LC_TIME="de_DE.utf8"
LC_COLLATE="de_DE.utf8"
LC_MONETARY="de_DE.utf8"
LC_MESSAGES="de_DE.utf8"
LC_ALL=
pepijnve commented 8 years ago

I'm still not able to reproduce this. Could you attach an example file that triggers this problem to this issue? You should be able to do that by changing the extension of your asciidoc file to .txt and then dragging it into the comment box. Perhaps the file is not encoded in UTF-8 and that's what's causing this problem...

thorstenkampe commented 8 years ago

Please find attached file test.txt

mojavelinux commented 8 years ago

Attaching an example file to GitHub to test encodings doesn't work since the browser will automatically reencode. You have to zip it.

pepijnve commented 8 years ago

I was afraid that might happen. Thanks for the tip @mojavelinux

pepijnve commented 8 years ago

Looking at the example file, I don't expect encoding of the source file to be the issue. As far as I can tell it doesn't contain any non-ASCII (<= 127) characters. @thorstenkampe could you zip the file and attach that instead just to be 100% sure that the source file is not the issues?

mojavelinux commented 8 years ago

Note that Asciidoctor does not process (aka normalize) lines of an include for a file extension it does not recognize. This means that an extension that works with literal blocks, like Asciidoctor Diagram, may receive lines that are not encoded in UTF-8. If the extension is reading in the file itself, then obvious it will also be responsible for encoding those lines as well.

I recommend using the normalize_lines helper in Asciidoctor whenever you are working with input that you cannot verify is UTF-8. That helper forces the String to be UTF-8 and avoids all downstream encoding issues.

It's a requirement when working with content in Asciidoctor that it is encoded in UTF-8. It's way easier for everyone to follow that rule than to try to figure out how to manage mixed encodings.

mojavelinux commented 8 years ago

Btw, this is easy to test. We just need to put an extension source file in the test suite that is encoded in something other than UTF-8. Then make sure that it does not cause Asciidoctor Diagram to fail. I'm trying to put together a simple scenario that you can use.

pepijnve commented 8 years ago

For external files normalize_lines is already called. Not easy to provide a hyperlink on my phone. See FileSource at the bottom of extensions.rb.

mojavelinux commented 8 years ago

I want to clarify too that you won't see this on a Linux box that is configured to use en_US.utf8 or other locales that are deeply rooted in UTF8. The problem is that regardless of the encoding of the file, certain versions of Ruby may read the file in not as UTF8. Then Ruby (and other applications) freak out when the String does not claim to be in UTF8. That's why I use the strategy of forcing all the lines to UTF8 (simply marking them as UTF8, not actually doing any reencoding).

And we always seem to have this problem with German because it seems German operating systems don't use UTF-8 by default (or at least not all the applications are configured this way).

mojavelinux commented 8 years ago

I see it now.

The hard part is setting up a test case that fails. The easy part is fixing it ;)

thorstenkampe commented 8 years ago

Zipped test file attached test.zip

mojavelinux commented 8 years ago

I'm convinced this has nothing to do with file and everything to do with the locale settings in the operation system or Ruby installation.

Can you print the output from the following?

asciidoctor -v
thorstenkampe commented 8 years ago
Asciidoctor 1.5.4 [http://asciidoctor.org]
Runtime Environment (ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-cygwin]) (lc:UTF-8 fs:Windows-1252 in:- ex:UTF-8)
mojavelinux commented 8 years ago

This is likely the culprit:

fs:Windows-1252

Asciidoctor core works around this by force encoding all input to UTF-8, but other programs may not. It has to be addressed all the way down the stack to support this scenario.

mojavelinux commented 8 years ago

Usually if the filesystem encoding is Windows-1252, it means the input encoding is as well (listed only as - in the version output).

mojavelinux commented 8 years ago

You can kind of test this using ruby -EUTF-8:Windows-1252, but there's no real way to test the exact scenario on a Windows or OSX box that I can tell. You have to debug on Cygwin running on Windows with a regional (German) setting.

@thorstenkampe It would be helpful if you could modify the following line and print some more information about the source string.

https://github.com/asciidoctor/asciidoctor-diagram/blob/master/lib/asciidoctor-diagram/extensions.rb#L120

In particular:

warn source.code.encoding

and

warn source.code.inspect

But the real fix from the Windows side is to set the operating system to use UTF-8 encoding. This is mentioned in the README.

https://github.com/asciidoctor/asciidoctor#requirements

Based on my understanding of encodings, I consider it to be fundamentally wrong if an operating system is not using UTF-8 for the locale and the filesystem.

mojavelinux commented 8 years ago

Oh, and one more thing:

file test.txt
thorstenkampe commented 8 years ago
thorstenkampe commented 8 years ago

If I insert those two warnings before line 120 the output becomes for the plantuml diagram:

asciidoctor-diagram: ERROR: Failed to generate image: invalid byte sequence in UTF-8
UTF-8
"class BlockProcessor\nclass PlantUmlBlock\nBlockProcessor <|-- PlantUmlBlock"

Same for the ditaa diagram...

pepijnve commented 8 years ago

@thorstenkampe If I remember correctly this error is commit from the Java sub process when it tries to decode the text of the diagram code. I'll have a look at adding diagnostic logging code around this so we can figure out where exactly things are going wrong. Once that's done I'll make a prerelease build of asciidoctor-diagram that you can test with.

mojavelinux commented 8 years ago

Good job modifying the code to add the warnings.

I think there is a deeper issue here with regard to how the file is being written to disk. The next step is to get the line number of the error. Could you also add the following line:

warn e.backtrace

chcp 65001 sets the terminal character set to UTF-8 in cmd. I'm not using cmd but zsh.

I meant to say that you need the same effective change. How to set it globally in Windows I don't actually know.

mojavelinux commented 8 years ago

The most efficient way to debug this issue is to reproduce the environment (Cygwin on a Windows image with Germany set as the region). I tried my hardest to reproduce the problem on Linux, but it's so hard to do because Linux tries as hard as it can to get encodings right. So it's like swimming against the tide ;)

thorstenkampe commented 8 years ago
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:106:in `match'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:106:in `block in registry_lookup'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:105:in `each_slice'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:105:in `each'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:105:in `map'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:105:in `registry_lookup'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:82:in `find_java'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:56:in `instance'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/util/java_socket.rb:51:in `load'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/plantuml/extension.rb:19:in `plantuml'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/plantuml/extension.rb:82:in `block in included'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/extensions.rb:164:in `instance_exec'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/extensions.rb:164:in `create_image_block'
/home/thorsten/.gem/ruby/gems/asciidoctor-diagram-1.5.1/lib/asciidoctor-diagram/extensions.rb:110:in `process'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/parser.rb:1088:in `[]'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/parser.rb:1088:in `build_block'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/parser.rb:883:in `next_block'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/parser.rb:315:in `next_section'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/parser.rb:67:in `parse'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/document.rb:471:in `parse'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor.rb:1344:in `load'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor.rb:1462:in `convert'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/cli/invoker.rb:94:in `block in invoke!'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/cli/invoker.rb:86:in `each'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/lib/asciidoctor/cli/invoker.rb:86:in `invoke!'
/home/thorsten/.gem/ruby/gems/asciidoctor-1.5.4/bin/asciidoctor:14:in `<top (required)>'
/home/thorsten/bin/asciidoctor:23:in `load'
/home/thorsten/bin/asciidoctor:23:in `<main>'
pepijnve commented 8 years ago

Ok that's unexpected; based on that stack trace I was looking entirely in the wrong direction. The code is failing when parsing values coming from the Windows registry. It does that on Windows in order to locate your JRE installation. Could you run the command

 reg query "HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft" /s /v JavaHome

in Cygwin and copy the output.

My Windows VM shows similar output to yours

$ asciidoctor -v
Asciidoctor 1.5.4 [http://asciidoctor.org]
Runtime Environment (ruby 2.2.5p319 (2016-04-26 revision 54774) [x86_64-cygwin]) (lc:UTF-8 fs:Windows-1252 in:- ex:UTF-8)

so I'm hoping that if I install a JVM in a path similar to yours that I'll get the same problem.

mojavelinux commented 8 years ago

At least we're getting somewhere!

This issue highlights that when we rescue an exception, we should probably provide a way to log the full backtrace (or print when verbose mode is on). We got bitten by this in core when we rescued the main load method. All kinds of errors were deceiving us.

Look for the $VERBOSE global variable to be true to check if verbose mode is on. (Unfortunately, we don't expose the value of the trace option, which would be ideal).

  warn e.backtrace if $VERBOSE
pepijnve commented 8 years ago

Good idea. I'll add that right here https://github.com/asciidoctor/asciidoctor-diagram/blob/master/lib/asciidoctor-diagram/extensions.rb#L118

pepijnve commented 8 years ago

After some further experimentation I think I can work around these issues by using open3 instead of backticks. I'll put that in place and then we can try again.

Seems to be an issue in how MRI is spawning the child process. You can see in https://github.com/ruby/ruby/blob/trunk/win32/win32.c#L1439 that there are variants that force UTF8 on the child process, but it's not clear to me how you can guarantee this particular function gets called. The calling code has a bunch of ifdef and special cases for cygwin which makes it difficult to trace.

I'm thinking the simpler alternative is going to be to use DL and/or fiddle to call the Win32 API to do the registry lookups. It's a bit of a pain to have to do this, but that will probably be simpler than trying to get the encoding problem fixed.

thorstenkampe commented 8 years ago

The output of reg query:

HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Development Kit\1.8
    JavaHome    REG_SZ    C:\Program Files\Java\jdk1.8.0_92

HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Development Kit\1.8.0_92
    JavaHome    REG_SZ    C:\Program Files\Java\jdk1.8.0_92

HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Plug-in\11.92.2
    JavaHome    REG_SZ    C:\Program Files\Java\jre1.8.0_92

HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment\1.8
    JavaHome    REG_SZ    C:\Program Files\Java\jre1.8.0_92

HKEY_LOCAL_MACHINE\SOFTWARE\JavaSoft\Java Runtime Environment\1.8.0_92
    JavaHome    REG_SZ    C:\Program Files\Java\jre1.8.0_92

Suchvorgang abgeschlossen: 5 bereinstimmende Zeichenfolge(n) gefunden.
thorstenkampe commented 8 years ago

Two remarks:

pepijnve commented 8 years ago

RJB isn't used in asciidoctor diagram. If you're running under JRuby the java code is executed directly. Otherwise a Java subprocess is started that communicates with the main process over a TCP socket (it's a tiny http daemon actually).

The reg query output looks fine to me; no idea why you would get invalid UTF-8 sequence errors on those. I think more debug output is going to be the way forward here. Without that this is starting to feel like looking for a needle in a haystack.

I should have some time tomorrow evening to continue working on this.

pepijnve commented 7 years ago

@thorstenkampe I just released 1.5.2 which already contains some robustness improvements which may resolve the issues you've been seeing. If you have a moment could you try out that version?

thorstenkampe commented 7 years ago

This fixes the invalid byte sequence in UTF-8. error. Although another error appeared:

asciidoctor-diagram: ERROR: Failed to generate image: PlantUML image generation failed: GraphViz 'dot' tool at '/usr/bin/dot.EXE' is not executable
> /usr/bin/dot -V
dot - graphviz version 2.38.0 (20140413.2041)

> file =dot
/usr/bin/dot: PE32+ executable (console) x86-64, for MS Windows

> ls -l =dot
-rwx------ 1 thorsten None 11795 30. Jun 2015  /usr/bin/dot*
pepijnve commented 7 years ago

Ok making progress... The extension is finding the dot executable with the cygwin path. That style of path only works for cygwin aware binaries though. Other Windows tools don't support these and somewhere along the line something is failing on this. I'll run some more tests using Cygwin graphviz to see where this might be failing.

pepijnve commented 7 years ago

@thorstenkampe I found some time to run tests with Cygwin Ruby and Graphviz. Commit 615da2a contains the necessary changes to make this work. Root cause of this issue is Cygwin vs Windows paths. Asciidoctor-diagram determines the path to dot in the Cygwin environment and finds /usr/bin/dot. That was then being passed to Java which operates under the regular Windows environment which can't interpret /usr/bin/dot correctly. Solution consist of using cygpath -w on the path so that a Windows style path can be passed on to Java.