databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Security Risk of using xmlschema-core as a dependency #567

Closed ericsun95 closed 2 years ago

ericsun95 commented 2 years ago

After version 0.10.0 we introduce a new dependency https://mvnrepository.com/artifact/org.apache.ws.xmlschema/xmlschema-core to the project. However, that package use TransformerFactory, DocumentBuilderFactory etc in an unsafe way that is vulnerable to be attached. Those weakly configured XML Parser will allow attacker injected untrusted data into application. The safest way to prevent XXE is always to disable DTDs (External Entities) completely. Do we truly need to rely on that dependency? Do we have alternatives?

srowen commented 2 years ago

Yes, it's needed to parse XSDs, IIRC. I believe DTD parsing is disabled. Can you be more specific about the threat here, ideally proposing some change that would lock down parsing more?

ericsun95 commented 2 years ago

Take this as an example, https://github.com/apache/ws-xmlschema/blob/f08d3a339fd546878b6848c8987fb9cb05fbb1c8/xmlschema-core/src/main/java/org/apache/ws/commons/schema/XmlSchema.java#L885-L886, it's better to specify attributes like these to prevent XML External Entity (XXE).

trFac.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, "");
trFac.setAttribute(XMLConstants.ACCESS_EXTERNAL_STYLESHEET, "");

Similarly for the DocumentBuilderFactory etc. For the xmlInputFactory we are using, probably do something like this?

factory.setProperty(XMLInputFactory.SUPPORT_DTD, false); // disable external entities
factory.setProperty("javax.xml.stream.isSupportingExternalEntities", false); 
srowen commented 2 years ago

In this project, XSDs are only used for validation against the schema rather than transforming XML. It looks like the two properties you mention can be set on SchemaFactory. Hm, I'm wondering whether we should set those, because I could imagine legitimately wanted to let an XSD Schema refer to external resources in this use case. An XSD Schema in this case would be controlled by the 'caller', not from a bunch of XML some application is ingesting, so I think the risk of some issue is smaller - if you can corrupt an XSD Schema file somewhere internally you can probably do quite a bit more.

For parsing of XML I agree this is an issue because the XML passing through some job using this to read it might be from untrusted sources. Both of the properties you suggest are already set.

ericsun95 commented 2 years ago

Yeah, I see a similar thread for this. Would you be able to release a new version for that?

srowen commented 2 years ago

Eventually, yes, usually every 3-6 months. If there's no other action for the XSD transformer I'll close this, but open to arguments about changing that too.