SciProgCentre / kmath

Kotlin mathematics extensions library
656 stars 55 forks source link

release() for DirectByteBuffer #78

Closed jnorthrup closed 4 years ago

jnorthrup commented 4 years ago

as you come upon DirectByteBuffers, see

https://stackoverflow.com/a/8462690

re: https://github.com/mipt-npm/kmath/blob/e52cfcaafe7eb1149170838d9910e54851ba1a92/kmath-memory/src/jvmMain/kotlin/scientifik/memory/ByteBufferMemory.kt#L88

altavir commented 4 years ago

Currently we do not use direct buffer allocation. Is there any motivation to use allocateDirect?

jnorthrup commented 4 years ago

I examine a solution from the jvm direct bytebuffer perspective more often than not. NIO RandomAccessFile("foo","r").channel().map() will give you a directbytebuffer

having the release available to close stray slices is a good feature to know ahead of time.

altavir commented 4 years ago

Currently Memory is the closed world, it does not consume externally produced buffer. In future, I would like to move to Apache Arrow implementation, but it will have its own API. Still, if there are some advantages in direct buffer over heap ones, I would like to consider it.

jnorthrup commented 4 years ago

I'm operating with temp tables that are just outside the bounds of composable in-memory operations. i need to run transformations in different passes from one large mmap file input layout to the next; use proces-level cleanup to keep things simple, then repeat.

please do keep the door open.

altavir commented 4 years ago

I can add some extensions to memory module to allow those cases. If you want this feature, then please write your use case in a little bit more detail, so I could understand how to write API for it. But I think you maybe should look into streaming support. I think that it is better to load data in small chunks and dispose it as soon as you do not need it. We can add some streaming processor extensions on top of new kotlinx-io API to make it convenient to work with files.

altavir commented 4 years ago

I've added some functionality to read block from file in 74653e74c6a6974db3cf51b658e289173a98fbb4, it automatically closes buffer after read. I am not sure I see how it could be used right now. Memory is used for fast in-memory access to aligned structures. I am not sure, that it is a good idea to use it for file access. It is better to used some well-developed techniques like Apache Arrow or parquet.

jnorthrup commented 4 years ago

mmap is demand paged memory backed by a file. java MAXINT bytes may be addressed per mapping of several against an unbounded extent of storage, accessed as a ByteBuffer in NIO, with thread safe concurrency. This has been standard since java 1.4 One does not "read" mmap. you just map a file and access memory.

jnorthrup commented 4 years ago

http://lkml.iu.edu/hypermail/linux/kernel/0004.0/0728.html i hope this gives you some insight.

if your data fits in the java heap, you don't benefit from file access. mine exceeds the feasability of arrow and other formats, and the intermediaries exceed 128gb ram heap; it's also likely that kmath will be provided against intermediary transforms adding heap pressure beyond what im already doing, so writing these volumes and mmap'ing them is efficient.

altavir commented 4 years ago

I understand how memory map works. The thing I am trying to tell is that I can't see the situation when you need to read the whole data in memory simultaneously in mathematical calculations. It is not possible to do operations on it anyway. Do you actually need random access to the whole file? Or do you still use it by-block. If this is the latter case, it is much better to invest in using streaming operations with asynchronous IO (I want to add this functionality anyway). You will get not only much smaller memory consumption, but also an ability to use parallel computations on independent blocks. People coming from Python ecosystem usually miss this opportunity since it is not easy to do it in Python, but Kotlin has much more tools for that.

jnorthrup commented 4 years ago

if it makes you happy to write scatter gather IO or async I have no issues with either one of those. I think you lack context and understanding of memory mapped file access, evident of wanting to close the file after you strobe it once. that's not efficient. Linus Torvalds explcitily used the simplest possible language to say "The thing you put in the PR gives you negative benefit" in the link above.

anyways, here's what you need to know: direct byte buffers can use memory map, and ByteBuffer slices may hold reference counted access members in the allocator after your primary hard ref is collectable. that SO article forces free on the direct heap allocation. we don't actually know the degree to which this affects mmap demand paged allocations but for the JVM that's all we get.

altavir commented 4 years ago

I do understand about memory-mapped files. The problem is that JDK does not seem to have public API to manually GC direct byte buffer references. There are some hacks using internal API, but they are not guaranteed to work on all JDK versions and could cause runtime errors on some versions. Basic behavior is to rely on GC for that.

Also Memory class is not intended to be used for IO. It is made for optimization of indirection of mathematical operations on structured values like Complex. IO module is planned and I think that what you ask goes there. It is possible to read large matrix directly from file, but it is not possible to do any reasonable operations on it, because it will require huge amount of memory. What you probably want is a pandas-like table API. It won't be done inside kmath, but instead will be implemented inside another project: https://github.com/mipt-npm/dataforge-core/issues/19. I see that you've marked the issue already, so we can rise the priority.

jnorthrup commented 4 years ago

when you write direct-buffer instead of java objects you eliminate gc and allocator slow paths.

over the years many projects housed in https://github.com/0xcopy have been open source extractions of commercial engagements among them a Modellica visulaization, compiler, and work queuing platform

in the longer-term I am doing a datacube implementation but will need a few things on the fastpath like minmax and simd summarization and the scales are well into 5 figure USD RAM hardware.

altavir commented 4 years ago

It seems like this one: https://openjdk.java.net/jeps/370 solves all problems mentioned above. We should probably wait a bit.

jnorthrup commented 4 years ago

i have a graph of 2.1 million row dataset for which the operations required are resample, pivot, groupby, sum, add scalar categories, convert categories to one-hot, then min-max. the number of columns from 2.1 million rows will be no fewer than 20k, 30k, or 40k

the pivot operation goes from 150megabytes of fwf to 55 gigs, the group by then has access patterns which might be On^2. doing this in FP style means that the pivot heap must coexist with the groupby heap in the virtual expanse and so even 128gigs of RAM (different workstation) falls into the severe knee bend of swapping to disk going from minutes to days. https://github.com/jnorthrup/columnar/commit/d8a9c7e422460cf47cf1028251abace1e0124f71#commitcomment-36458931 (graph)

I'm not in a hurry, and these transforms are not critically sourced from math libs, but this is not going away for me, and my imperative is always to externalize feqatures i don't wish to maintain, in this case, the only comparable livbrary to the one im refining is stxxl (stxxl.org) in c++. stxxl is better, no doubt, but I think kotlin has a performant future with IR.

The graph is the perfect usecase for mmap - but i've also tweaked ubuntu to have "effectively striped" ssd hosting the swap tmpfs (two swap partitions with same pri) and running zswap lzo which can be seen creatinging a perfectly uniform sinewave of system interrupts up until the point the system has no more headroom to compress the swap and ONLY THEN starts to unload at about (70-180)*2 megabytes per second compressed. the kotlin is still single threaded, but still waiting for the verdict on parralell flows architecture.

depending on the access pattern -- the 55 gigs of lzo swap occupies between 32 and 47 gigabytes of reported swap. i know you were curious about that. when you call fallocate, you have 55 gigs tmpfs file occupying 0 bytes, until you start tainting pages

altavir commented 4 years ago

Let's see, most of operations you need are in fact not terminal and could be done via regular map-reduce. If terminal operation do not occur early in the analysis chain, it is much more simple to implement normal stream processing than to play around with multi-gb heaps. I did not try to do it with Spark, they say it is not quite easy to launch it with local data, but you can give it a try. I did much more complicated streaming analysis with DataForge and it works exceptionally well. We already have API for basic streaming processing in kmath, but we do not have (and probably won't have) API for tables. Still, if you have streaming reader for your format, it is quite easy to implement basic map-reduce on pure kotlin using Flow. I can help you to create simple example, it you define your data format and operations you need to perform.

jnorthrup commented 4 years ago

the internal format I am initially targeting is a composable projection of lambdas over an input iterator which is first implemented as fwf(fixed width format as it is produced by pandas and my compatible database streaming adapters upstream)

fwf is first choice due to the available random access guarantees for which there are Flow collectors at each step. Arrow is less mature than it could be, as column based iterator, so you really shouldn't hold your breath that there is a canned library that will save the day, other than perhaps csv iterators via sqlite. any jdbc source, even sqlite, or arrow, or bounded serializer is going to be a higher iteration expense and likely to break the heap on input. this is too big for msgpack or parquet or avro. the jvm options for protobuff are all explictly heap based when directbuffers are adequate and cheaper, there are no direct memory alternatives exxcept the immature arrow implementation to date, and arrow access patterns requrie an arrow serializer first at huge cost to line up each column instead of storage by row.

it appears Flow is pulling in outer scope innefficiently for each composable iterator (FlowCollector instance) in the series causing things that would seem to be idempotent in pure system or in purer languages, to be poor escape analysis in the kotlin jvm runtime. the composable operators do in fact appear to execute asyncronously in a diagonal progression of depth in coroutine ordering however. it is possible that Flow is a bad design choice here and pure hot lambdas without suspend will be tighter.

writing pivot data to mmap virtual memory is not one of the a composable algorithm operations, but it is a terminal operation setup for what you describe, creating a new binary fwf intermediary for copsable iterator stage2 without the existing 19 gigabytes of flow/lambdas present on the java heap. without doing a much deeper analysis what i see i the debugger is that inlining has created a mess of temporaries that don't exist in the ideal composable operator.

in my java experience, the gap from mmap position to binary boxing to java heap and escape analysis is only causing heap pressure overheads and is not inherrently a source of leaks. lambda overhead storing offsets, indexes, and any convenient varargs must be scrutinized more carefully to be able to fit the composable flows in the heap as they play out. in other words this strategy works, but could be more perfectly trimmed to scale faster.

breaking composability to structure asymptotic mapreduce interface steps is a goal im trying to reiterate back into .. composable operations.

the pivot example graphed is a proof that composable flows and lambdas iterating a mapreduce source back to a symetrically mmap'd destination is a sufficient strategy with a modest jvm heap to terminate with mmap writer steps with a constant heap overhead -- the same as what you propose with spark.

mmap -> iterator to flow decoder -> composable lambdas -> writer-iterator -> mmap ->

for group-by, this plays out with another estimated 2_200_000 9_800 input conversion of sparse data to 12009_800 densely summarized rows (smaller on-disk) following a potential 2_200_000 page faults of ByteBuffer iterating mmap expanse.

the net outcome i am targetting while orthogoonal to kmath is as follows:

iterators of mmap (or unmapped) bytes exist has a new design falling out of this decomposition: where iterator is random-access (for fwf) or pre-indexed with an intial EOL scanner or streaming:

addressable/forward-only input/output fixed-width/line-parsed binary/parsed row/column sequential orientation (e.g. not/is == arrow)

so far using kotlin has made the expressiveness of ordering and remapping columns and assigning lambdas on sets of colums trivial code to be mated to these iterators -- as composable flows. this is a source of lambda overhead that appears to be magnified by inline per row instead of scoped more efficiently like in java 8 lambdas.

approaching this iteratively means i have had the freedom to examine the moving parts as they emerge. the thing i don't have is the luxury of assmuming that the datasets fit in Heap, and given the availability of mmap, there is no reason to do more than structure bytebuffer IO operations as I have been doing for a very long time.

jnorthrup commented 4 years ago

you may benefit from this information, albiet i would run it in 2x speed.

https://youtu.be/UswxcAOJKBE

the ignorance of efficient memory tiers is likely a contributing factor that has bankrupted hadoop.

it works when a giant like google hand-tunes a search engine with 40000 engineers but making flexible and generic solutions in this space requires foethought and makes the parameters required for such things as "running spark on local data" an uninterestng savings of work.

jnorthrup commented 4 years ago

I hit the reset button on the dataframes as a cursor library for FWF files. the intermediary heap costs went from 55 gigs to negligable without suspend, flows, or serializable stdlib objects in play.

i've arrived at the point where i can use some guidance integrating with "maths"

this vector conversion operator https://github.com/jnorthrup/columnar/blob/55f46249ff50c77fc6d97489f91ee244c348684f/src/main/kotlin/columnar/VectorLike.kt#L36

and a pragmatic tuples analog https://github.com/jnorthrup/columnar/blob/de1478bb83ca4e080caa0df09f00252bca90ca44/src/main/kotlin/columnar/Twop13.kt#L31

provide me with a much denser functional composition than what's in the stdlib.

im pretty much working with nothing but function references n deep until i wrap the hierarchy with a stdlib serializable inheritor -- the println proves a pure function boundary before and after the reification by Serializable takes place. This feature fell out of accidentally modelling too many FP resources.

that said, I am not handing out directheap refs, though that could conceivably be an even lazier cursor option than I have written to date with almost no additional code for such a driver. the cursors reify primitive objects from string conversion for the path of (textual) FWF decoding.

I'm going to assume that mmap design decisions you make will have few implications for my intended short-lived process tasks and close this issue.