databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Data leakage in exception message #593

Closed Mike-Robert-CHC closed 2 years ago

Mike-Robert-CHC commented 2 years ago

When Spark XML detects malformed input data, sensitive information contained in the record can be leaked into logs by the exception message in StaxXmlParser.scala lines 106-107:

throw new IllegalArgumentException(
  s"Malformed line in FAILFAST mode: ${abbreviatedRecord.replaceAll("\n", "")}", cause)

May we change this to the following?

logger.debug("Malformed line:", abbreviatedRecord.replaceAll("\n", ""))
throw new IllegalArgumentException("Malformed line in FAILFAST mode", cause)
srowen commented 2 years ago

Hm, what's the difference w.r.t. security here. The caller has access to both, presumably. Putting it in the exception seems a little more helpful and immediate. (Though sticking a whole record in an exception message could be messy.)

Mike-Robert-CHC commented 2 years ago

More people have access to our logs than the raw data that is processed by the application. Saving log4j messages is configurable by a log4j.config, but exceptions are more difficult -- partly because this presents as a SparkException (with details in the inner exception) and partly because the exception is thrown lazily by the first action (and that action is disjointed from the Spark XML configuration).

I am happy for other suggestions. The smallest change I could think of was moving abbreviatedRecord to a log. For the tightest security, I considered removing abbreviatedRecord, but considered others may use it for debugging.

srowen commented 2 years ago

I get it. This is probably OK. I'll comment on the PR