asciidoctor / asciidoctor-maven-plugin

A Maven plugin that uses Asciidoctor via JRuby to process AsciiDoc source files within the project.
http://asciidoctor.org
Apache License 2.0
317 stars 122 forks source link

problem file encoding (Umlaute) for external PlantUML diagrams #586

Open wumpz opened 2 years ago

wumpz commented 2 years ago

I am not sure, if this is the right place or the asciidoctor-diagram project. So hopefully here is the right one.

My maven projects source code is / should be completely UTF-8. Now I want to build a maven site and the pages should be asciidoctor files and integrate an PlantUML diagram, which comes from a file. This diagram is generated but seems to have always the wrong encoding but the internal diagrams are correct.

So how do I tell asciidoctor, that this diagram files should be UTF-8?

What I did / tried so far:

  1. changed file.encoding while starting maven (-Dfile.encoding=UTF-8)
  2. defined project source encoding in maven
  3. defined project reporting encoding in maven
  4. different Java versions
  5. tried to configure default_external parameter, which had no effect
  6. changed defined project encodings, to get some change

BTW my environment is Windows 11, Java 8, 11, 17, Maven 3.6, 3.8.

I attached a minimal maven project (asciidoctor1.zip) . Just run site:site or look into the target directory I sent.

Look into target/site directory:

So it seems that asciidoctor (diagrams) tries to always use Cp1252 for external PlantUML files, which is strange, since I already reset file encoding to UTF-8.

So what did I wrong?

abelsromero commented 2 years ago

There's something here, but I need to setup a Windows vm, so it may take some extra time to answer.

Files should already be UTF-8, Asciidoctor does not understand other encodings, and in non-Win OSs the example just crashes when processing the cp1252 file. Why in Windows cp1252 works and utf-8 is what I need to research, we only use project.build.sourceEncoding to copy resources which you don't do in the example.

I understand that the end goal is to have all files in UTF-8 right? mixing encodings is not going to work ever.

wumpz commented 2 years ago

Right. All should be UTF-8. I just included this cp1252 to test and got lucky. However using ISO-8859-1 works as well, same encoding at least for those characters.

If you remove this cp1252 stuff does a non Windows machine render the utf pumls right?

abelsromero commented 2 years ago

If you remove this cp1252 stuff does a non Windows machine render the utf pumls right?

Yes. In fact non-Windows (testing MacOs now) totally crash with org.jruby.exceptions.ArgumentError: (ArgumentError) asciidoctor: FAILED: <stdin>: Failed to load AsciiDoc document - invalid byte sequence in UTF-8. That's a common thing for ppl to ask about asciidoctor, you can find several reports googling for it.

That's why I am pluzzed that you get the opposite effect and need to do research. I know Windows does not crash, but using cp1252 as default 🤔

wumpz commented 2 years ago

Strange. This should be the same as starting java with -Dfile.encoding=UTF-8. Is there another instance of JVM started somehow in the rendering process? At the moment in windows Cp1252 is the standard encoding in Java but in Linux and MacOs its UTF-8.