Including files that are not encoded in UTF-8

bedoro commented 5 years ago

We currently evaluate using AsciiDoctor (through asciidoctor-maven-plugin) for the documentation of a very large codebase. We are trying to include .java source files directly into the documents. All our Java-Files are encoded using CP1252 and can contain special characters (i.e. ä/ö/ü). Now, I know that it would be much better to encode them in UTF-8, but there are historical and internal reasons why this is not an option.

When including the source file, we too get the error (ArgumentError) invalid byte sequence in UTF-8. Including source files directly into the document is one of the main reasons for using AsciiDoctor.

Dan Allen already suggested using -Dfile.encoding=UTF-8 here: https://github.com/asciidoctor/asciidoctor/issues/2884#issuecomment-490830197. As you can see from the stacktrace I have posted in that thread, the flag is not changing anything for me.

Is there any way at all to work around this issue? Would it be an option to extend the include-directive to convert the strings to UTF-8 before converting them to binary? Are there any other options we could try or are we just SOL?

mojavelinux commented 5 years ago

Please provide a sample project so that we have something to test. Then, I'd be glad to look into it and think about where a change would need to go, if any.

bedoro commented 5 years ago

Please provide a sample project so that we have something to test. Then, I'd be glad to look into it and think about where a change would need to go, if any.

Here is a minimal example: https://github.com/bedoro/asciidoctor-encoding-include-demo. It's just a basic pom with asciidoctor-maven-plugin and a few test adoc and java files.

Thanks so much for the help!

ysb33r commented 5 years ago

I demonstrated a safe workaround for now in Gitter. Basically if you add an includeProcessor extension you can do the conversion from CP1252 to UTF-8 in there.

There is an example of how to do this in https://github.com/asciidoctor/asciidoctorj/issues/815. THe code itself is relatively simple (show here in Groovy, but should be easy to translate into a Java extension)

include_processor(filter: {it.endsWith('.java')}) { doc, reader, target, attrs ->
                        new File(target).withReader('CP1252') { r ->
                           reader.push_include(r.text, target, target,1, attrs)
                        }
                }

Basically you need to filter on only targets that end in .java, then open the file specifying CP1252 explicitly - Java will do the rest to get it to Unicode. Finally you need to push this converted content back to the reader.

mojavelinux commented 5 years ago

If a custom include processor is the only way to solve this, then a change is definitely needed. I want to do a little bit of experimentation first to see if there really is no other option.

bedoro commented 5 years ago

I was able to solve the problem thanks to @ysb33r's suggestions and help.

I've implemented an IncludeProcessor that handles includes ending with .java and force opening them with the correct encoding.

mojavelinux commented 5 years ago

I'd like to further understand our options before closing.

ysb33r commented 5 years ago

@mojavelinux I think in Asciidoctorj we are fortunate because we already have encoding functionality built-in with the JDK. The problem might arise if you want to do this in Ruby-core as I am not sure how much direct support you have for this in Ruby except maybe through using ICU.

mojavelinux commented 9 months ago

Asciidoctor now permits the encoding of the include file being included to be specified on the include directive using the encoding attribute. The value has to be a recognized encoding name in Ruby (e.g., windows-1252). See https://docs.asciidoctor.org/asciidoc/latest/directives/include/#include-syntax

asciidoctor / asciidoctorj

Including files that are not encoded in UTF-8 #815