Closed ggamiranda closed 9 months ago
Thank you for bringing this to our attention. Do you know if this issue only affects XML processing? or are you able to reproduce it in other contexts?
We have also observed it when we get values from one of our internal java APIs. e.g.: String output = api.getValue("key");
Can you write a small program that demonstrates the problem?
I've also created a separate case in AWS support where the conclusion is to use the work around as defaults may vary between different setup/environment.
But just to continue. The program just strips the xml header elements from the body and during the transformation the special characters are converted to ?.
package test;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class TestRemoveEnvelope {
public static void main(String[] args) throws SAXException, IOException {
FileInputStream ins = new FileInputStream("C:/test/input.xml");
FileOutputStream out = new FileOutputStream("C:/test/output.xml");
String bodyXpath = "/Envelope/Body";
String encoding = "UTF-8";
PrintWriter pw = new PrintWriter(new OutputStreamWriter(out, encoding), true);
DefaultHandler handler = new RemoveEnvelopeHandler(pw, bodyXpath);
XMLReader reader = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
reader.setContentHandler(handler);
reader.setFeature("http://xml.org/sax/features/namespaces", true);
reader.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
reader.parse(new InputSource(ins));
pw.flush();
}
}
package test;
import java.io.IOException; import java.io.PrintWriter;
import org.xml.sax.Attributes; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler;
public class RemoveEnvelopeHandler extends DefaultHandler {
private PrintWriter pw;
private StringBuilder contextPath = new StringBuilder("");
private String bodyXpath;
private StringBuilder currentData = new StringBuilder();
private boolean isBodyFound;
private boolean inBody;
public RemoveEnvelopeHandler(PrintWriter pw, String bodyXpath) throws SAXException, IOException {
this.bodyXpath = bodyXpath;
this.pw = pw;
}
@Override
public void startDocument() throws SAXException {
isBodyFound = inBody = bodyXpath == null || bodyXpath.length() == 0 || bodyXpath.equals("/");
}
@Override
public void endDocument() throws SAXException {
if (!isBodyFound) {
throw new SAXException("No body was found for body xpath: " + bodyXpath);
}
}
@Override
public void startElement(String uri, String localName,
String qName, Attributes attributes) throws SAXException {
contextPath.append("/").append(qName);
if (!inBody) {
// When the contextPath matches the target bodyXpath toggle inBody
inBody = contextPath.toString().equalsIgnoreCase(bodyXpath);
if(inBody) {
isBodyFound = inBody;
}
debug("Skipping start element for path: " + contextPath);
} else {
// Parsing elements inside the body (inBody)
currentData.delete(0, currentData.length());
StringBuilder attribBuffer = null;
String attribString = "";
for (int i = 0; i < attributes.getLength(); i++) {
if (attribBuffer == null) {
attribBuffer = new StringBuilder(" ");
}
attribBuffer.append(attributes.getQName(i)).append("=\"");
attribBuffer.append(attributes.getValue(i));
attribBuffer.append("\" ");
}
if (attribBuffer != null) {
debug("Found attributes: " + attribBuffer);
attribString = attribBuffer.toString();
}
// Print the element to the output stream
pw.print("<" + qName + attribString + ">");
debug("Wrote start element: " + contextPath);
}
}
@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
if (inBody) {
String data = currentData.toString();
if (data.length() > 0) {
pw.print(data);
currentData.delete(0, currentData.length());
}
if (bodyXpath != null && bodyXpath.equalsIgnoreCase(contextPath.toString())) {
inBody = false;
}
}
contextPath.delete(contextPath.length() - qName.length() - 1, contextPath.length());
if (inBody) {
String element = "</" + qName + ">";
pw.print(element);
debug("Wrote end element: " + contextPath);
}
else {
debug("Skipping end element for path: " + contextPath);
}
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (!inBody) {
return;
}
String data = new String(ch, start, length);
if (data.isEmpty() || data.trim().isEmpty()) {
return;
}
if (data.indexOf('&') > -1){
data = data.replace("&", "&");
}
if (data.indexOf('<') > -1){
data = data.replace("<", "<");
}
if (data.indexOf('>') > -1){
data = data.replace(">", ">");
}
currentData.append(data);
}
private void debug(String s) {
}
}
Thank you for the reproducer code. I don't suppose you could attach the input xml file to the ticket? Since it is a question of encoding it would be useful to have the input file. I agree with support that linux and windows may configure the default encoding separately. My main concern with this ticket is your statement that the behavior has changed between one release and another on the same platform (linux). Can you confirm that the encoding works on linux with 17.0.9+8-LTS, but not with 17.0.10?
Input: <?xml version="1.0" encoding="ISO-8859-1"?>
Yes, we have seen it work with 17.0.9+8-LTS in os.version=5.15.0-1051-azure. We recently updated the jdk to 17.0.10+7-LTS in this 5.15.0-1051-azure environment and is working fine. Could there be something in 4.14.334-252.552.amzn2.x86_64 vs 5.15.0-1051-azure?
Hi @ggamiranda, There are couple things I'd like you to check:
Check OutputStreamWriter encoding. Instead of
PrintWriter pw = new PrintWriter(new OutputStreamWriter(out, encoding), true);
write
OutputStreamWriter outWriter = new OutputStreamWriter(out, encoding);
System.out.println(outWriter.getEcoding());
PrintWriter pw = new PrintWriter(outWriter, true);
In characters()
, are you sure you've got correct data not ?
? Your input xml document goes through the process of character decoding when it is parsed. As you don't specify its encoding, some default one is used. The input doc content is decoded in UTF-16 by SAXParser which your PrintWriter encodes into UTF-8. You need to check the content is correctly decode into UTF-16.
If you know the encoding of your input xml document, you can use:
InputSource inSource = new InputSource(ins);
inSource.setEncoding(...)
reader.parse(inSource);
or
reader.parse(new InputSource(new InputStreamReader(ins, doc_encoding)));
instead of
reader.parse(new InputSource(ins));
Thanks for the response. I tried to replicate the issue using the code above and the same environment but I can't replicate it so there might be other factors affecting it. Will add some debug and see what I can find.
Issue was found out to be in our application. There were some incorrect information that we've got that led us to believe that this was specific to jdk/aws issue. Apologies and thanks!
We have observed an issue in linux jdk 17.0.10+7-LTS with System file.encoding=ANSI_X3.4-1968 where the output conversion of special characters become ?
To Reproduce'
Code snippet where it strips the root element or envelope and output the body or inner elements of the XML.
String encoding = "UTF-8"; PrintWriter pw = new PrintWriter(new OutputStreamWriter(out, encoding), true); DefaultHandler handler = new RemoveEnvelopeHandler(pw, bodyXpath, manifest, cat); XMLReader reader = SAXParserPool.getParser(); reader.setContentHandler(handler); reader.setFeature("http://xml.org/sax/features/namespaces", true); reader.parse(new InputSource(ins)); pw.flush();
Expected behavior
The expected behavior is that the special characters are properly converted
Screenshots
Platform information
Additional context
Works okay in Windows OS with the same jdk version and encoding. Working in linux17.0.9+8-LTS and encoding=ANSI_X3.4-1968
Work around we did was set the System file.encoding in the jvm to UTF-8.