Closed PeterNmp closed 4 years ago
Are you sure there are the same amounts? hard to say without any reproduction Compressed or uncompressed?
They are uncompressed. I can try process the compressed .gz files (that did help when we had the problem before version 0.7.0). I'm fairly sure about the counts - did count like this on the files: grep -o '\<Release>' file.xml | wc -l Count here didn't match result form processing in databricks. btw we are running with "option("mode","FAILFAST")"
Yeah I'd be interested if the compressed case is different. They are different code paths and both rely a bit on assumptions about the implementation to get it right. The main fix last time was for the uncompressed path indeed. What Hadoop / Spark version?
We are running Databricks in Azure 6.3 (includes Apache Spark 2.4.4, Scala 2.11) Tried running on an uncompressed file - it gives the correct count on all the dataframes. When running on the same uncompressed file one of the dataframes consistently gets the wrong count - one less than expected. I'm sorry but i can't share the files :(
Hm, OK. What kind of compression? I do have some tests that check compressed files across block boundaries but there may well be all kinds of corner cases. Really, splittable or unsplittable compression?
It's gzip compression. I don't know if it's splittable or not but the compressed files seem to run slower and require more memory on the nodes. Before the fix in 0.7.0 we saw the same - that the gzipped files would process correctly but not the unzipped ones.
Yeah gzip-compressed text is not splittable. I do have a test case for that which appears to work, but who knows. The logic for handling this case is copied from Hadoop even.
To clarify, you have one big file? and how many records do you expect vs see? that might narrow down a guess at what is going on.
If you can, a different compression like bzip2 would probably be better all around (smaller, splittable) and may happen to avoid this.
We process one big file split by option("rowTag"... The counts for processing the file that fails are:
Compressed: Count of Release : 4825182 Count of SoundRecording: 4825182 Count of ReleaseTransactions: 4825182
Uncompressed: Count of Release: 4825181 Count of SoundRecording: 4825182 Count of ReleaseTransactions: 4825182
As you can see - one off on Release processing the uncompressed file. We recieve the files gzipped - could try to unzip and re-compress. We really appreciate your help here!
Do you have any way of telling which Release doesn't seem to be present - is it in the middle or at the end? maybe not. I am not sure how it happens but have some guesses; not sure how to fix it even if those guesses are right.
Certainly you can try recompressing as you might get better performance. right now this probably runs just one task because the file isn't splittable.
Sorry – been some time since updating this issue. Tried bzip2 format and it seems to behave like gzip i.e. no errors but we do not seem to get the benefit of splitting the files.
I looked through the errors we have seen so far and here is a list of where we are missing records: File1 (total lines = 77580849) – xml missing around line 19433586 File2 (total lines = 277617855) – xml missing around line 69024653 File3 (total lines = 260228464) – xml missing around line 145926138 File4 (total lines = 405442857) – xml missing around line 256136432
Also, this exclusion of elements seem to happen in different components of the xml file. Hope this helps!
OK so it happens on uncompressed files and misses one record. I'm pretty sure this is the weak point: https://github.com/databricks/spark-xml/blob/master/src/main/scala/com/databricks/spark/xml/XmlInputFormat.scala#L152
It's hacky, but works fine in general. I would not be surprised if there's a corner case here, where somehow the inferred file position is past the end of a partition, but it really isn't and so has missed a tag. I can't at the moment think of the case where this breaks down though. Multi-byte characters? is this any unusual encoding?
Any chance you can try Spark 3? no idea whether it helps. Or the latest 0.9.0 version?
Yeah I get that you can't share the files. If you want to spend some time with this, I can advise about how to inspect what might go wrong, but I imagine it's really hard to debug if it requires executing over huge files. Maybe a matter of logging the exact state when it decides that the file position is beyond the end. I recall it was fairly tricky to narrow this down on a tiny trivial file, which prompted the original fix.
I'm sorry I don't have good ideas now, and I think there probably is a niche but real problem in this hack. If you're able, it seems like compressing the files avoids this code path and might work - is that viable?
Hi, I also met similar problem when read large xml file with different row tags. For example, A file with row tag: A, B, C, when I generate df_A, df_B and df_C, the count for each dataframe varies by each run. Somtimes one record missed, sometimes a small chunk of records missed.
Thanks again! We are running the latest 0.9.0, and as far as I remember that version fixed some of the errors we got. We are running on UTF-8 files so we should be ok encoding wise? It might be a niche problem, but we cannot lose transactions. If you find it tricky to narrow down errors on small files, I think we will try to run on uncompressed files … though it somewhat defeats the purpose of running spark… Apart from debugging this ourselves - is there any other way we can help get this problem solved? Contacting Databricks?
Encoding won't matter, or at least, UTF-8 should be fine. I am also the only guy at Databricks who maintains this informally, so that won't help I'm afraid. It just comes to me.
Of course there are workarounds - no compression, or splittable compression, it seems (right? bzip2 worked?). Those would actually be more compatible with Spark as they are splittable.
Knowing that it only affects gzip does help narrow it down, because that would mean it has to do with the non-splittable case. I have a decent theory about why it happens, though it may not be consistent with your findings.
To figure out when to stop reading a split, it looks at how much of the underlying file has been read vs where the split should stop in the file. This is tricky. In the compressed case, I think what happens is that it can only report how much of the compressed file has been read - but the decompressor buffers reads. So it may read more than it has returned. This could cause the logic to prematurely decide there is no more to read.
That makes good sense except that then I would expect you miss a record or two off of the end of each file, not in the middle. Does that make sense - is that actually what you observe?
Fixing that isn't hard just means more hacking. I can pull together a POC if that theory sounds right and you're willing to run it.
Sorry just pinging you @PeterNmp on this issue too to see if you can test - let me know. If it works, great, I'll make a new release. If not, I'll try to think of something else!
Hi, bzip2 did work, but spark did not seem to split the file – same as with gzip. All the compressed files work but are not split so they take a long time to process. It’s only once we try to process the uncompressed files we get the problem – so the problem seems to be in the splitting scenario? We loose records in the middle of the file (see comment above) running on uncompressed files. We would be very happy to test possible fixes but is seems the current fix is on compressed files and we are not seeing any problems there?
Wait, I thought the problem was with compressed files? See comments starting at https://github.com/databricks/spark-xml/issues/450#issuecomment-637589410 Just want to clarify we're even looking in the right place
Sorry - I can see how that is very confusing! No - the problem is on the uncompressed files. All compressed files are processed fine.
Here's another theory: https://github.com/databricks/spark-xml/pull/468 I don't know why bzip2 isn't splittable. It should always be. Maybe something about how it is being encoded. Are you leaving a .bz2 suffix on the files?
Here's another theory: #468 I don't know why bzip2 isn't splittable. It should always be. Maybe something about how it is being encoded. Are you leaving a .bz2 suffix on the files?
Hey srowen. I also met same case when using databricks-xml 0.9.0 with glue-1.0 (spark 2.4.3).
If anyone can reproduce this on 0.9.0, and can run a test - build from the change in #468 or I can whip up an assembly JAR. I'm still not sure what's going on but that is my latest OK guess.
Hi,
Thanks for all the effort put into this library! We still seem to be having this issue related to #399 with 0.9.0 :( We have large xmlfiles - 10+ GB with format like this:
When I count the number of SoundRecording/Release/ReleaseTransactions in the files it is the same (and should be), but processing the files like this: spark.read.format("com.databricks.spark.xml").....option("rowTag","SoundRecording") Gives me different counts of SoundRecording/Release/ReleaseTransactions for some files processed.