Special characters are converted to ? when writing to an output

ggamiranda commented 9 months ago

We have observed an issue in linux jdk 17.0.10+7-LTS with System file.encoding=ANSI_X3.4-1968 where the output conversion of special characters become ?

To Reproduce'

Code snippet where it strips the root element or envelope and output the body or inner elements of the XML.

String encoding = "UTF-8"; PrintWriter pw = new PrintWriter(new OutputStreamWriter(out, encoding), true); DefaultHandler handler = new RemoveEnvelopeHandler(pw, bodyXpath, manifest, cat); XMLReader reader = SAXParserPool.getParser(); reader.setContentHandler(handler); reader.setFeature("http://xml.org/sax/features/namespaces", true); reader.parse(new InputSource(ins)); pw.flush();

Expected behavior

The expected behavior is that the special characters are properly converted

Screenshots

Platform information

OS: Amazon Linux 2
Version OpenJDK 64-Bit Server VM [Amazon.com Inc. 17.0.10+7-LTS]

Additional context

Works okay in Windows OS with the same jdk version and encoding. Working in linux17.0.9+8-LTS and encoding=ANSI_X3.4-1968

Work around we did was set the System file.encoding in the jvm to UTF-8.

earthling-amzn commented 9 months ago

Thank you for bringing this to our attention. Do you know if this issue only affects XML processing? or are you able to reproduce it in other contexts?

ggamiranda commented 9 months ago

We have also observed it when we get values from one of our internal java APIs. e.g.: String output = api.getValue("key");

earthling-amzn commented 9 months ago

Can you write a small program that demonstrates the problem?

ggamiranda commented 9 months ago

I've also created a separate case in AWS support where the conclusion is to use the work around as defaults may vary between different setup/environment.

But just to continue. The program just strips the xml header elements from the body and during the transformation the special characters are converted to ?.

package test;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;

import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;

public class TestRemoveEnvelope {

    public static void main(String[] args) throws SAXException, IOException {

        FileInputStream ins = new FileInputStream("C:/test/input.xml");
        FileOutputStream out = new FileOutputStream("C:/test/output.xml");

        String bodyXpath = "/Envelope/Body";
        String encoding = "UTF-8";
        PrintWriter pw = new PrintWriter(new OutputStreamWriter(out, encoding), true);
        DefaultHandler handler = new RemoveEnvelopeHandler(pw, bodyXpath);
        XMLReader reader = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
        reader.setContentHandler(handler);
        reader.setFeature("http://xml.org/sax/features/namespaces", true);
        reader.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
        reader.parse(new InputSource(ins));

        pw.flush();
    }
}

package test;

import java.io.IOException; import java.io.PrintWriter;

import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler;

public class RemoveEnvelopeHandler extends DefaultHandler {

    private PrintWriter pw;
    private StringBuilder contextPath = new StringBuilder("");
    private String bodyXpath;
    private StringBuilder currentData = new StringBuilder();
    private boolean isBodyFound;
    private boolean inBody;

    public RemoveEnvelopeHandler(PrintWriter pw, String bodyXpath) throws SAXException, IOException {
        this.bodyXpath = bodyXpath;
        this.pw = pw;

    }

    @Override
    public void startDocument() throws SAXException {
        isBodyFound = inBody = bodyXpath == null || bodyXpath.length() == 0 || bodyXpath.equals("/");
    }

    @Override
    public void endDocument() throws SAXException {
        if (!isBodyFound) {
            throw new SAXException("No body was found for body xpath: " + bodyXpath);
        }
    }

    @Override
    public void startElement(String uri, String localName, 
        String qName, Attributes attributes) throws SAXException {
        contextPath.append("/").append(qName);
        if (!inBody) {
            // When the contextPath matches the target bodyXpath toggle inBody
            inBody = contextPath.toString().equalsIgnoreCase(bodyXpath);
            if(inBody) {
                isBodyFound = inBody;
            }
            debug("Skipping start element for path: " + contextPath);
        } else {
            // Parsing elements inside the body (inBody)
            currentData.delete(0, currentData.length());
            StringBuilder attribBuffer = null;
            String attribString = "";

            for (int i = 0; i < attributes.getLength(); i++) {
                if (attribBuffer == null) {
                    attribBuffer = new StringBuilder(" ");
                }
                attribBuffer.append(attributes.getQName(i)).append("=\"");
                attribBuffer.append(attributes.getValue(i));
                attribBuffer.append("\" ");
            }
            if (attribBuffer != null) {
                debug("Found attributes: " + attribBuffer);
                attribString = attribBuffer.toString();
            }

            // Print the element to the output stream
            pw.print("<" + qName  + attribString + ">");
            debug("Wrote start element: " + contextPath);
        }
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if (inBody) {
            String data = currentData.toString();
            if (data.length() > 0) {
                pw.print(data);
                currentData.delete(0, currentData.length());
            }
            if (bodyXpath != null && bodyXpath.equalsIgnoreCase(contextPath.toString())) {
                inBody = false;
            }
        }

        contextPath.delete(contextPath.length() - qName.length() - 1, contextPath.length());
        if (inBody) {
            String element = "</" + qName + ">";
            pw.print(element);

            debug("Wrote end element: " + contextPath);
        }
        else {
            debug("Skipping end element for path: " + contextPath);
        }
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {

        if (!inBody) {
            return;
        }

        String data = new String(ch, start, length);
        if (data.isEmpty() || data.trim().isEmpty()) {
            return;
        }
        if (data.indexOf('&') > -1){
            data = data.replace("&", "&amp;");
        }
        if (data.indexOf('<') > -1){
            data = data.replace("<", "&lt;");
        }
        if (data.indexOf('>') > -1){
            data = data.replace(">", "&gt;");
        }
        currentData.append(data);
    }

    private void debug(String s) {
    }
}

earthling-amzn commented 9 months ago

Thank you for the reproducer code. I don't suppose you could attach the input xml file to the ticket? Since it is a question of encoding it would be useful to have the input file. I agree with support that linux and windows may configure the default encoding separately. My main concern with this ticket is your statement that the behavior has changed between one release and another on the same platform (linux). Can you confirm that the encoding works on linux with 17.0.9+8-LTS, but not with 17.0.10?

ggamiranda commented 9 months ago

Input: <?xml version="1.0" encoding="ISO-8859-1"?>

alpha_ååå_äää_ööö bravo_ååå_äää_ööö charlie_ååå_äää_ööö

Yes, we have seen it work with 17.0.9+8-LTS in os.version=5.15.0-1051-azure. We recently updated the jdk to 17.0.10+7-LTS in this 5.15.0-1051-azure environment and is working fine. Could there be something in 4.14.334-252.552.amzn2.x86_64 vs 5.15.0-1051-azure?

eastig commented 9 months ago

Hi @ggamiranda, There are couple things I'd like you to check:

Check OutputStreamWriter encoding. Instead of

PrintWriter pw = new PrintWriter(new OutputStreamWriter(out, encoding), true);

write

OutputStreamWriter outWriter = new OutputStreamWriter(out, encoding);
System.out.println(outWriter.getEcoding());
PrintWriter pw = new PrintWriter(outWriter, true);

In characters(), are you sure you've got correct data not ?? Your input xml document goes through the process of character decoding when it is parsed. As you don't specify its encoding, some default one is used. The input doc content is decoded in UTF-16 by SAXParser which your PrintWriter encodes into UTF-8. You need to check the content is correctly decode into UTF-16.

eastig commented 9 months ago

If you know the encoding of your input xml document, you can use:

InputSource inSource = new InputSource(ins);
inSource.setEncoding(...)
reader.parse(inSource);

or

reader.parse(new InputSource(new InputStreamReader(ins, doc_encoding)));

instead of

reader.parse(new InputSource(ins));

ggamiranda commented 9 months ago

Thanks for the response. I tried to replicate the issue using the code above and the same environment but I can't replicate it so there might be other factors affecting it. Will add some debug and see what I can find.

ggamiranda commented 9 months ago

Issue was found out to be in our application. There were some incorrect information that we've got that led us to believe that this was specific to jdk/aws issue. Apologies and thanks!

corretto / corretto-17