Closed jbeynon closed 9 years ago
@etorreborre Can you please take a look at this? It's been a major blocker in upgrading past Scoobi 0.8.3 and is necessary for anyone using Avro and compression.
@jbeynon I have re-triggered the build, to see if it goes green this time. If so, I will merge.
Thanks mark! and if you could publish 0.9.1 it would be awesome ;-)
Released now (sorry for not having been very reactive with the merge, thanks @markhibberd!).
No worries. I know how easy it is for notifications to get lost in the noise.
Eric and Mark, thank you! about the release: I believe you forgot to publish the 2.10 version: https://oss.sonatype.org/content/repositories/releases/com/nicta/scoobi_2.11/ has 0.9.1 but it is missing in: https://oss.sonatype.org/content/repositories/releases/com/nicta/scoobi_2.10/
Alex, the release is now available for 2.10.
Since 0.8.5 I noticed that compression was broken when using Avro but was never bothered enough to look into it. With the latest 0.9.0 release I finally took some time and found the issue. The problem manifests it as either an OutOfMemoryError or yarn killing tasks for going "beyond memory limits" when using Deflate or Snappy and isn't specific to Avro, only noticed because Avro silently changes GZip to Deflate.
This stacktrace from explicitly using Snappy is what helped me. Basically the issue is that
DataSink.configureCompression
is being called for every emit from EvnDoFn. It creates a compressor to test that the settings are working and then promptly discards it. The problem with this is that the Deflate and Snappy compressors create off-heap buffers usingByteBuffer.allocateDirect
and this memory does not get GC'd as you'd expect. So for each emit you get a 64kb (for Snappy, not sure the buffer size for Deflate) memory leak.Anyway, so I added a simple check in DataSink so that configureCompression only actually acts once and things seem to work now. I haven't run a full regression but this is a pretty mild change.
By the way, this would have been much easier to diagnose because configureCompression has a debug log in it and you'd see in the logs thousands of log messages, but I couldn't get debug output working in the latest build. No matter what I tried I either got default logging or no logging.