Data loss when input file partitioned through rowTag element

PeterNmp commented 4 years ago

Hi,

Thanks for all the effort put into this library! We still seem to be having this issue related to #399 with 0.9.0 :( We have large xmlfiles - 10+ GB with format like this:

...
<SoundRecording>
...
</SoundRecording>
...
<Release>
...
</Release>
...
<ReleaseTransactions>
...
</ReleaseTransactions>

When I count the number of SoundRecording/Release/ReleaseTransactions in the files it is the same (and should be), but processing the files like this: spark.read.format("com.databricks.spark.xml").....option("rowTag","SoundRecording") Gives me different counts of SoundRecording/Release/ReleaseTransactions for some files processed.

srowen commented 4 years ago

Are you sure there are the same amounts? hard to say without any reproduction Compressed or uncompressed?

PeterNmp commented 4 years ago

They are uncompressed. I can try process the compressed .gz files (that did help when we had the problem before version 0.7.0). I'm fairly sure about the counts - did count like this on the files: grep -o '\<Release>' file.xml | wc -l Count here didn't match result form processing in databricks. btw we are running with "option("mode","FAILFAST")"

srowen commented 4 years ago

Yeah I'd be interested if the compressed case is different. They are different code paths and both rely a bit on assumptions about the implementation to get it right. The main fix last time was for the uncompressed path indeed. What Hadoop / Spark version?

PeterNmp commented 4 years ago

We are running Databricks in Azure 6.3 (includes Apache Spark 2.4.4, Scala 2.11) Tried running on an uncompressed file - it gives the correct count on all the dataframes. When running on the same uncompressed file one of the dataframes consistently gets the wrong count - one less than expected. I'm sorry but i can't share the files :(

srowen commented 4 years ago

Hm, OK. What kind of compression? I do have some tests that check compressed files across block boundaries but there may well be all kinds of corner cases. Really, splittable or unsplittable compression?

PeterNmp commented 4 years ago

It's gzip compression. I don't know if it's splittable or not but the compressed files seem to run slower and require more memory on the nodes. Before the fix in 0.7.0 we saw the same - that the gzipped files would process correctly but not the unzipped ones.

srowen commented 4 years ago

Yeah gzip-compressed text is not splittable. I do have a test case for that which appears to work, but who knows. The logic for handling this case is copied from Hadoop even.

To clarify, you have one big file? and how many records do you expect vs see? that might narrow down a guess at what is going on.

If you can, a different compression like bzip2 would probably be better all around (smaller, splittable) and may happen to avoid this.

PeterNmp commented 4 years ago

We process one big file split by option("rowTag"... The counts for processing the file that fails are:

Compressed: Count of Release : 4825182 Count of SoundRecording: 4825182 Count of ReleaseTransactions: 4825182

Uncompressed: Count of Release: 4825181 Count of SoundRecording: 4825182 Count of ReleaseTransactions: 4825182

As you can see - one off on Release processing the uncompressed file. We recieve the files gzipped - could try to unzip and re-compress. We really appreciate your help here!

srowen commented 4 years ago

Do you have any way of telling which Release doesn't seem to be present - is it in the middle or at the end? maybe not. I am not sure how it happens but have some guesses; not sure how to fix it even if those guesses are right.

Certainly you can try recompressing as you might get better performance. right now this probably runs just one task because the file isn't splittable.

PeterNmp commented 4 years ago

Sorry – been some time since updating this issue. Tried bzip2 format and it seems to behave like gzip i.e. no errors but we do not seem to get the benefit of splitting the files.

I looked through the errors we have seen so far and here is a list of where we are missing records: File1 (total lines = 77580849) – xml missing around line 19433586 File2 (total lines = 277617855) – xml missing around line 69024653 File3 (total lines = 260228464) – xml missing around line 145926138 File4 (total lines = 405442857) – xml missing around line 256136432

Also, this exclusion of elements seem to happen in different components of the xml file. Hope this helps!

srowen commented 4 years ago

OK so it happens on uncompressed files and misses one record. I'm pretty sure this is the weak point: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala#L152

It's hacky, but works fine in general. I would not be surprised if there's a corner case here, where somehow the inferred file position is past the end of a partition, but it really isn't and so has missed a tag. I can't at the moment think of the case where this breaks down though. Multi-byte characters? is this any unusual encoding?

Any chance you can try Spark 3? no idea whether it helps. Or the latest 0.9.0 version?

Yeah I get that you can't share the files. If you want to spend some time with this, I can advise about how to inspect what might go wrong, but I imagine it's really hard to debug if it requires executing over huge files. Maybe a matter of logging the exact state when it decides that the file position is beyond the end. I recall it was fairly tricky to narrow this down on a tiny trivial file, which prompted the original fix.

I'm sorry I don't have good ideas now, and I think there probably is a niche but real problem in this hack. If you're able, it seems like compressing the files avoids this code path and might work - is that viable?

ericsun95 commented 4 years ago

Hi, I also met similar problem when read large xml file with different row tags. For example, A file with row tag: A, B, C, when I generate df_A, df_B and df_C, the count for each dataframe varies by each run. Somtimes one record missed, sometimes a small chunk of records missed.

PeterNmp commented 4 years ago

Thanks again! We are running the latest 0.9.0, and as far as I remember that version fixed some of the errors we got. We are running on UTF-8 files so we should be ok encoding wise? It might be a niche problem, but we cannot lose transactions. If you find it tricky to narrow down errors on small files, I think we will try to run on uncompressed files … though it somewhat defeats the purpose of running spark… Apart from debugging this ourselves - is there any other way we can help get this problem solved? Contacting Databricks?

srowen commented 4 years ago

Encoding won't matter, or at least, UTF-8 should be fine. I am also the only guy at Databricks who maintains this informally, so that won't help I'm afraid. It just comes to me.

Of course there are workarounds - no compression, or splittable compression, it seems (right? bzip2 worked?). Those would actually be more compatible with Spark as they are splittable.

Knowing that it only affects gzip does help narrow it down, because that would mean it has to do with the non-splittable case. I have a decent theory about why it happens, though it may not be consistent with your findings.

To figure out when to stop reading a split, it looks at how much of the underlying file has been read vs where the split should stop in the file. This is tricky. In the compressed case, I think what happens is that it can only report how much of the compressed file has been read - but the decompressor buffers reads. So it may read more than it has returned. This could cause the logic to prematurely decide there is no more to read.

That makes good sense except that then I would expect you miss a record or two off of the end of each file, not in the middle. Does that make sense - is that actually what you observe?

Fixing that isn't hard just means more hacking. I can pull together a POC if that theory sounds right and you're willing to run it.

srowen commented 4 years ago

Sorry just pinging you @PeterNmp on this issue too to see if you can test - let me know. If it works, great, I'll make a new release. If not, I'll try to think of something else!

PeterNmp commented 4 years ago

Hi, bzip2 did work, but spark did not seem to split the file – same as with gzip. All the compressed files work but are not split so they take a long time to process. It’s only once we try to process the uncompressed files we get the problem – so the problem seems to be in the splitting scenario? We loose records in the middle of the file (see comment above) running on uncompressed files. We would be very happy to test possible fixes but is seems the current fix is on compressed files and we are not seeing any problems there?

srowen commented 4 years ago

Wait, I thought the problem was with compressed files? See comments starting at https://github.com/databricks/spark-xml/issues/450#issuecomment-637589410 Just want to clarify we're even looking in the right place

PeterNmp commented 4 years ago

Sorry - I can see how that is very confusing! No - the problem is on the uncompressed files. All compressed files are processed fine.

srowen commented 4 years ago

Here's another theory: https://github.com/databricks/spark-xml/pull/468 I don't know why bzip2 isn't splittable. It should always be. Maybe something about how it is being encoded. Are you leaving a .bz2 suffix on the files?

ericsun95 commented 4 years ago

Here's another theory: #468 I don't know why bzip2 isn't splittable. It should always be. Maybe something about how it is being encoded. Are you leaving a .bz2 suffix on the files?

Hey srowen. I also met same case when using databricks-xml 0.9.0 with glue-1.0 (spark 2.4.3).

For compressed file (bzip2): The record numbers are correct, while it didn't split the file, which makes the whole process slow and I have to repartition after getting the dataframe.
For uncompressed file (xml): there would be one record missing all the time. While when i just run locally, there wouldn't be any missing record. So weird.

srowen commented 4 years ago

If anyone can reproduce this on 0.9.0, and can run a test - build from the change in #468 or I can whip up an assembly JAR. I'm still not sure what's going on but that is my latest OK guess.

databricks / spark-xml

Data loss when input file partitioned through rowTag element #450