agentm / project-m36

Project: M36 Relational Algebra Engine
The Unlicense
876 stars 47 forks source link

Use streamly to reduce memory usage #7

Open agentm opened 8 years ago

agentm commented 8 years ago

This would allow relations to be processed in constant space when reading/writing from relation files.

Using pipes would also open the door to pipes-concurrency workers.

Consider how this could impact mmap'ing files.

agentm commented 6 years ago

The new streamly package looks interesting because it offers native parallelism, spawning threads as necessary when the current threads are busy.

I jousted with parallelism using Control.Concurrent but put it on the back-burner when I discovered that it is difficult to statically guess in advance how "big" sparks should be and how many sparks it makes sense to generate with parListChunk.

Streamly is new and offers basic heuristics, but it would be great if project-m36 didn't have to guess up-front on the count and size of sparks.

YuMingLiao commented 1 year ago

After doing a bit research on this topic, I would like to leave a heads-up here. Although streamly is faster than list in most of the operations, appendR / foldr is where it loses. (eg. in :showexpr)

agentm commented 1 year ago

Consider that it's not just about memory usage in the stream draining step. For a case such as x{a,b}, we would be like to stream tuples from disk (a tuple cache) and execute the projection potentially in constant memory size. Currently, Project:M36 naively loads all the tuples into RAM from disk, then executes the projection, which effectively prohibits relations from growing beyond the RAM size.

Certainly, there will be cases where streamly uses more RAM since it can do more in asynchronous parallel fashion, for example, but that's a valuable price to pay, I believe.

In short, I think we need some sort of streaming solution here to operate on tuples in batches and in parallel and it appears that streamly is best-of-breed.

Is there an alternative?

YuMingLiao commented 1 year ago

Sure, I think streamly is promising, too. If I load a big rel var, typing and opening database becomes slow. I believe streamly can be helpful for this one.

I was interested in this because my restrict / filter query toward a big rel var is slow. So far, I guess B-Tree indexing would speed up filter operation the most.

I am aware of data independence. Refactoring seems a big job ...

To try another simpler way to address my problem, I think caching RelaitonalExpr queries to Relation might be a good idea. (I don't need to update cache because in my case I don't need to from time to time update datas about food nutrition.)

For example, for a relation{ foodName Text, foodClass Text, nutrient Text, per100g Scientific, unit Text} (5x100000) named nutrition, I can cache ...

agentm commented 1 year ago

Indeed, I think we are in complete agreement. Haskell makes refactoring relatively safe and I don't expect the user-exposed APIs to change whatsoever until we are ready to introduce a streaming API over the network.

We do intend to add a relexpr cache, as you surmised, which would cache high-cost expression evaluations.

As the front page states: "Software can almost always be made faster, it can rarely be made more correct." I think we finally nailed down the mathematically correct APIs, so now we just need to go faster.

YuMingLiao commented 1 year ago

Yep, I really appreciate this project! Thanks for the hard work. Relational algebra is way easier to understand than sql.

And for just saying, I just realized that assign a new relvar with extended field of an RelationalExpr is already a pretty-good, customizable cache for static data.