Azure / usql

U-SQL Examples and Issue Tracking
http://usql.io
MIT License
232 stars 683 forks source link

System.OutOfMemoryException in AvroExtractor #107

Open viblo opened 6 years ago

viblo commented 6 years ago

I have a 320mb big avro file. When using the AvroExtractor on it I get a System.OutOfMemoryException.

 at System.IO.MemoryStream.set_Capacity(Int32 value)
   at System.IO.MemoryStream.EnsureCapacity(Int32 value)
   at System.IO.MemoryStream.Write(Byte[] buffer, Int32 offset, Int32 count)
   at System.IO.Stream.InternalCopyTo(Stream destination, Int32 bufferSize)
   at Microsoft.Analytics.Samples.Formats.ApacheAvro.AvroExtractor.<Extract>d__3.MoveNext()
   at ScopeEngine.SqlIpExtractor<ScopeEngine::CosmosInput,Extract_2_Data0>.GetNextRow(SqlIpExtractor<ScopeEngine::CosmosInput\,Extract_2_Data0>* , Extract_2_Data0* output) in d:\data\ccs\jobs\3b349459-c713-4500-b28b-3ecc540f25b5_v0\sqlmanaged.h:line 1924

The docs here https://docs.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-u-sql-programmability-guide says that the limit is 0.5GB for a UDO. I guess it make sense with the exception, the extractor copies the input stream which is 320mb.

I dont have full control of the input avro file, it is created from a Stream Analytics Jobs that reads from a event hub partitioned by partitionid, and then save the result to blob which is read by the u-sql job.

How can I work around this problem?

viblo commented 6 years ago

After some coding I managed to make AvroExtractor not copy the whole input stream. Instead I created a new wrapper around the UnstructuredReader provided by the USQL SDK that keeps an internal buffer of 1MB that ensures the AvroExtractor and ApacheAvro parser can seek back as it wants. Note that if the buffer is too small it might break. Also, this wrapper only implements what is needed for ApacheAvro file parser, and I have only tested it on my own avro files together with the unit tests in Microsoft.Analytics.Samples.Formats, so make sure to verify that it works for your file before you trust it with production files.

The update AvroExtractor is available in a fork here: https://github.com/nordicfactory/usql

stevenwilliamsmis commented 5 years ago

viblo, I have been able to use your UnstructuredReaderAvroWrapper class for our project successfully except for one bug. It is not able to load avro files larger than 2GB. I wanted to create a branch in your project but was unable to do so. For anyone who is interested, the fix is to change the _tmpBufferStartPosition variable from an int to a long. We are now able to load our 2+GB avro files in USQL.

viblo commented 5 years ago

Thanks for the find. I have made a branch myself with your proposed fix. Note that Im away from work on a long holiday, so it will take a very long time until I can try it out for real and merge into master (now I made the commit directly from github ui, without any compilation or tests).