Closed bedoro closed 9 months ago
Please provide a sample project so that we have something to test. Then, I'd be glad to look into it and think about where a change would need to go, if any.
Please provide a sample project so that we have something to test. Then, I'd be glad to look into it and think about where a change would need to go, if any.
Here is a minimal example: https://github.com/bedoro/asciidoctor-encoding-include-demo. It's just a basic pom
with asciidoctor-maven-plugin
and a few test adoc
and java
files.
Thanks so much for the help!
I demonstrated a safe workaround for now in Gitter. Basically if you add an includeProcessor extension you can do the conversion from CP1252 to UTF-8 in there.
There is an example of how to do this in https://github.com/asciidoctor/asciidoctorj/issues/815. THe code itself is relatively simple (show here in Groovy, but should be easy to translate into a Java extension)
include_processor(filter: {it.endsWith('.java')}) { doc, reader, target, attrs ->
new File(target).withReader('CP1252') { r ->
reader.push_include(r.text, target, target,1, attrs)
}
}
Basically you need to filter on only targets that end in .java
, then open the file specifying CP1252 explicitly - Java will do the rest to get it to Unicode. Finally you need to push this converted content back to the reader.
If a custom include processor is the only way to solve this, then a change is definitely needed. I want to do a little bit of experimentation first to see if there really is no other option.
I was able to solve the problem thanks to @ysb33r's suggestions and help.
I've implemented an IncludeProcessor that handles includes ending with .java
and force opening them with the correct encoding.
I'd like to further understand our options before closing.
@mojavelinux I think in Asciidoctorj we are fortunate because we already have encoding functionality built-in with the JDK. The problem might arise if you want to do this in Ruby-core as I am not sure how much direct support you have for this in Ruby except maybe through using ICU.
Asciidoctor now permits the encoding of the include file being included to be specified on the include directive using the encoding
attribute. The value has to be a recognized encoding name in Ruby (e.g., windows-1252). See https://docs.asciidoctor.org/asciidoc/latest/directives/include/#include-syntax
We currently evaluate using AsciiDoctor (through
asciidoctor-maven-plugin
) for the documentation of a very large codebase. We are trying to include.java
source files directly into the documents. All our Java-Files are encoded using CP1252 and can contain special characters (i.e. ä/ö/ü). Now, I know that it would be much better to encode them in UTF-8, but there are historical and internal reasons why this is not an option.When including the source file, we too get the error
(ArgumentError) invalid byte sequence in UTF-8
. Including source files directly into the document is one of the main reasons for using AsciiDoctor.Dan Allen already suggested using
-Dfile.encoding=UTF-8
here: https://github.com/asciidoctor/asciidoctor/issues/2884#issuecomment-490830197. As you can see from the stacktrace I have posted in that thread, the flag is not changing anything for me.Is there any way at all to work around this issue? Would it be an option to extend the
include
-directive to convert the strings to UTF-8 before converting them to binary? Are there any other options we could try or are we just SOL?