databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Reading file with ContentType application/octet-stream #667

Closed mahmoud-masmoudi-dev closed 10 months ago

mahmoud-masmoudi-dev commented 10 months ago

Reading a gzipped xml file with spark.read.format("xml").option("rowTag", "").load("s3:///").display() returned "OK" (meaning no data)

After downloading and re-uploading the same file to the exact same location, re-running the same command returned a table with data.

After some investigation, I saw that the metadata of the original file using aws s3api head-object --bucket <bucket> --key <key> returned { "AcceptRanges": "bytes", "Expiration": "expiry-date=\"Sun, 29 Oct 2023 00:00:00 GMT\", rule-id=\"delete_after_10_days\"", "LastModified": "2023-10-18T02:16:36+00:00", "ContentLength": 24663, "ETag": "\"9292bc4c2d7d4c9ed32389ea2de964ce\"", "ContentEncoding": "gzip", "ContentType": "application/octet-stream", "ServerSideEncryption": "AES256", "Metadata": {} }

and the metadata AFTER the re-upload

{ "AcceptRanges": "bytes", "Expiration": "expiry-date=\"Tue, 31 Oct 2023 00:00:00 GMT\", rule-id=\"delete_after_10_days\"", "LastModified": "2023-10-20T14:36:30+00:00", "ContentLength": 24958, "ETag": "\"ca8f73c5f9dba53eda22913ecc94632a\"", "ContentType": "application/x-gzip", "ServerSideEncryption": "AES256", "Metadata": {} }

Please note the ContentEncoding and ContentType for the first case And the ContentType for the second case

Somehow, spark fails to read the data if ContentEncoding is gzip AND/OR ContentType is application/octet-stream

Any ideas ?

srowen commented 10 months ago

The content type doesn't matter, but the encoding does. Should decompress .gzip for you. I don't know what the error is though. Check the files are valid

mahmoud-masmoudi-dev commented 10 months ago

the files are valid the problem is, spark-xml fails to read it on s3. But it succeeds to read it if I download it and re-upload it !

srowen commented 10 months ago

It's not related to this library then