Closed mandyshieh closed 6 years ago
I'd recommend improving the ParquetLoader to handle larger data by moving from standard arrays to our BigArray
class and moving from ints to longs. @TomFinley can give a more informed opinion.
@veikkoeeva : As I'm a bit out of my element, my normal fall-back answer is running perf tests and see if GC perf is being impacted.
Stepping back a bit, one larger question for the ParquetLoader is can we operate more in a streaming fashion and not store large blocks. The TransposeLoader is accomplishing a similar task as ParquetLoader by loading a column orientated dataset.
@justinormont The working in chunks was actually what I was thinking also: load data in and process data chunks and release arrays that aren't needed anymore. Depending on how this particular type is used, I would imagine it to have serious fragmentation issues in "production systems", i.e. this code wouldn't be run in isolation.
I'm still learning this and try not to hijack the issues for other discussions... But sometimes I just try to be quick with it. This goes with what I was thinking at https://github.com/dotnet/machinelearning/pull/109#issuecomment-388142435 too. Have a few sets, column and row oriented and operate on them, preferably making the sources and sinks such that one can pass in and out chunks thereby relieving memory pressure (which is basically a blocker in many production loads). I think the idea is there, but I don't yet see how to pull all this together in my head and I do wonder if there's too many "specific types" instead of more common types building on .NET idioms and then extended a bit in specific ways (I might be wrong about this). :) I'll take a look at TransposeLoader.
I think the loader isn't keeping block data around. The issue is: when coming up with a sequence of blocks to load, when using shuffling to see the data in a different order (important for some stochastic learners), the problem is the sequence. So the problem isn't that we're loading the file into memory and that's too large, this issue is meant to address a situation in which the total number of blocks or number of rows per block is so large you cannot allocate an array to describe what order you should be serving them in.
My question would be: did we actually observe this happening, or is it more of a theoretical concern that it might happen, under wildly improbably conditions? I'm just trying to imagine what an actual failure would look like. In the past I've seen parquet files with block sizes in the hundreds of MBs, with there being in a typical dataset perhaps dozens to hundreds of such blocks. Even if somehow we configured the block size at 1 MB the idea of a parquet file with billions of blocks in it is many orders is a bit hard to swallow... I've seen petabyte sized datasets, but I don't think I've ever seen an attempt to store such a thing in a single Parquet file. Maybe I'm behind the times and usage has changed enormously, I haven't had occasion to use Spark in about a year or so...
If it's something we think might happen, then @justinormont 's suggestion of BigArray
is appropriate. If not then just throwing an exception would be fine.
@TomFinley I wonder if you're writing to me also. :) It'd be a separate issue to go internals of BigArray
, but I can elaborate what I had in mind. I suppose there wouldn't be opposition to too BigArray
to a benchmarking suite and then if someone comes up with improved functionality, to pull them in. I saw @KrzysztofCwalina is working on adding benchmarks, so maybe this could be one item on the list too.
The reason I wrote the jagged array might be problematic is: The loader doesn't need to keep block data around, it's enough it just creates them. They're potentially large, there might be many loaders. Say, someone writes a tool to tune parameters and handle data in https://github.com/dotnet/orleans and uses this library (potentially many objects instantiated and removed frequently).
It would ideal to have a benchmarking for potentially critical pieces, but in general terms I think the problem gets worse if one uses more loaders or if the system otherwise use large chunks of memory or operations that cause pinning (which production systems with Sockets or other native pointers do). One effective way to combat these particular problems is allocating a large slab of memory and then giving out chunks as needed, effectively pooling. Pinning inside of these large (this particular problem will also be mitigated), contiguous areas of memory isn't a problem since the memory doesn't need to be moved, similarily fragmentation doesn't become a problem. Poolers, such as ArrayPool
allocate and combine chunks efficiently. I.e. it's not just this particular piece used to load data and then do some ML analysis in isolation, but playing nicely with the overall system where these libraries might be used.
My experience has been -- this is anedoctally and subjective as written, but searching Internet reveals more accurate accounts -- that code allocating large arrays (or jagged arrays) that are even a bit longer lived results into heap fragmentation which has then led to OOMs since there isn't sufficiently large contiguous block to allocate -- or just performance degration.
How it might look like then? Some cases for perusal:
A nice case: https://www.infoq.com/presentations/bing-net-performance, some future discussion: https://github.com/dotnet/coreclr/issues/12426.
<edit: For those of interested about the new memory handling classes:
... And ArrayPool
has all the tracing infrastructure built-in. :)
These exceptions should be caught with suggestions to alter block size as a fix.