actorDB and sqlite FTS features

jeromew commented 8 years ago

Hello,

Your project is a very clever piece of engineering ! I am trying to take a grasp of the pros & cons and the type of projects that fit well within actordb's replication model.

If I understand well, every actor instance is a kind of "monolith" in the sense that this instance should fully fit inside one node (except maybe for KV). So I guess that an actor instance should not grow too big otherwise it will create an unsolvable problem (?)

Can you tell me if FTS features (https://www.sqlite.org/fts3.html) or other sqlite plugins can be expected to work inside actordb ?

SergejJurecko commented 8 years ago

The types of projects we consider a perfect fit: microservices, saas solutions, (mobile) app backends.

Not having too large actors is usually best. Though having 100 gigs inside one should work fine. But your ops/sec is limited to that instance.

If an actor grows large you can create a new copy, then split it in half. Not ideal but not unsolvable either.

We havent enabled fts in the builds. We will enable rtree next version. Fts I'm not sure about yet. Full text search is something usually better solved by elasticsearch. We are of course open to suggestions.

jeromew commented 8 years ago

Thanks for your answers. I can see how actordb fits with the use cases you mention.

by splitting in half an actor when it becomes too big, do you mean "change your app" so that it now knows how to handle 2 actors ?

Regarding FTS, is it something that I could do myself easily simply bu activating fts in the build or do you think that it would involve modifying the code that intercepts the wall ?

I am trying to build a cluster on low end vps : low memory, nodes are not all hosted with the same hosting provider. A lot more reads than writes.

This comes at odds with 2 things you say in the documentation :

the more memory, the better
it is better when nodes are linked with 1Gbps links

Do you have an idea of how a cluster would behave in such harsh conditions ? would it impact only write query, or would it impact read queries ?

Regarding the memory footprint, do you have any numbers on the minimum requirement ? does actordb degrade cleanly (by slowing down?) when memory becomes scarce or should I fear the OOM like I do when I use java solutions (like elastic search)

Sorry for all these questions. ActorDB's choice of constraints is really intriguing as a distributed system (and different in a good way to what I have seen so far)

SergejJurecko commented 8 years ago

Low memory/cpu impacts everything (by slowing it down). We do run a few installations on low memory/cpu conditions and it works fine. ActorDB does check how much memory it has to work with and imposes some soft limits when that is the case. It also uses backpressure if requests start piling up.

We haven't encountered OOM killer yet. I've enabled FTS5 extension in this build. https://s3-eu-west-1.amazonaws.com/biokoda/actordb-0.10.16-preview-linux.tar.gz

You can also build it yourself. You need linux with at least gcc 4.9, latest erlang.

bra-fsn commented 8 years ago

BTW, how many actors can we use, what are the limitations here?

SergejJurecko commented 8 years ago

Every actor requires some memory to run. What we recommend is an actor-per-user type of granularity, unless you have very small amount of data you are saving per user. It depends on the queries or size of data that you have. An actor for up to 10GB (should not be an issue for more), or an actor for up to 1000 queries/s.

For tiny granularity you should use the KV mode. For medium granularity actors, for large granularity postgresql.

bra-fsn commented 8 years ago

I guess the limitations come from:

the actors have to keep the sqlite db open and in sync (file handles, memory, some cpu time)
there is some bookkeeping associated with that (which nodes/clusters have which actors, if I'm using the terminology right). AFAIK this is mostly limited to one actor, so this is where more machines don't help.

But what are the realistic limits here? Can I have 1, 5, 10, 100, 500 millions of actors? I mostly try to find the "this doesn't scale" point. Keeping the filehandles open on a machine for an actor can scale with the number of machines. Keeping which node has which actors in a single sqlite may not.

BTW, in KV mode I see it's possible to query all shards. How well does work with billions of rows? I mostly mean here I have billions of rows in different shards and I want to query a list of ids which have a specific value and this result set can be some 100 millions of rows. Can this be paged (with a cursor for example) efficiently?

SergejJurecko commented 8 years ago

There are no file handles, because every individual sqilte is inside a lmdb file. There is an sqlite connection and per connection cache which requires some memory (can be a few MB for larger actors). The cache is an area we will tackle at some point in the near future. We can get it way down.

Bookkeeping is a KV store that basically holds the name of actors that belong to it.

Scalability of a single machine is more dependent on memory. How many actors can it have open at the same time. If you have an actor-per-user and that user has not done anything in more than a minute, it only takes up storage space.

Cursor paging for huge KV has not been implemented yet.

bra-fsn commented 8 years ago

Thanks. Very interesting product, I pretty much hope you can keep it evolving.

SergejJurecko commented 8 years ago

Well if you have a use case for which we don't have some feature you can tell us. If it makes sense we can prioritize differently.

jeromew commented 8 years ago

I started thinking about how I could use actordb. One of the things for which I don't understand the performance impact :

lookup searches across actors : how to find all users which have a specific attribute (it could be slow but I don't want to freeze the whole system)
statistics across actors (a sort of map/reduce I suppose). Again, I don't want this to have a drastic impact on performance
joins across different actor types (for example: users and products, with many users and many products).

cursor paging for select on ACTOR type1(*) is an interesting scenario. As I understand I can

(option1) launch the select on ACTOR type1(*)
(option2) select all the actors of type1 and then do the select for each actor (a sort of app-level cursor)

If there are many actors, It seems to me that in option1, the actordb is going to slow down ; and in option2 I will need to capture all the actor names which can grow memory-wise. Maybe (but I am not sure) there is another option with paging.

kulshekhar commented 8 years ago

I may be wrong but from what I gather, ActorDB is ideally suited for use cases where there is minimal interaction between different actors of the same type. I'm not sure if ActorDB would be a good fit in this use case, @jeromew

Though I too would be interested in knowing how ActorDB handles this.

SergejJurecko commented 8 years ago

The way to go about doing it is to extract those attributes into a new actor type or maybe into a KV type. It can be only one actor. There is nothing wrong with having an actor type and only one actor for that type.
Same with statistics. Whenever you find yourself needing a actor type(*) query, think about splitting something.
A map/reduce type framework is in the works.

jeromew commented 8 years ago

OK I understand the idea of creating a "materialized view" actor, which is an aggregate view of other actors. Once it is materialized, it is fast to query it ;

But is there an easy way to "extract those attributes into a new actor" once actors are already created ? I maybe have missed something in the actorDB DSL that does that.

SergejJurecko commented 8 years ago

Creating a new actor with the query results of an old actor is something that will be possible in the map/reduce framework. Right now it's something you must do from the client side.

isoos commented 8 years ago

Is the FTS module enabled in recent builds? Any experience with its performance and scalability?

SergejJurecko commented 8 years ago

It's enabled. But that means FTS is per-actor. So you can have search within an actor. We have not done any performance measurements.

isoos commented 8 years ago

Per-actor looks fine - assuming that we can select across all actors if needed (?).

SergejJurecko commented 8 years ago

Yes but querying across all actors is not a good idea, unless the numbers are relatively small.

Generally if you find yourself creating queries across all your actors that means you should extract some piece of data into a new actor type.

On Aug 23, 2016 2:46 PM, "István Soós" notifications@github.com wrote:

Per-actor looks fine - assuming that we can select across all actors if needed (?).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/biokoda/actordb/issues/24#issuecomment-241718787, or mute the thread https://github.com/notifications/unsubscribe-auth/AABGHjMWeu7BSZqdhqr4agigzLHWKZF6ks5qiuu0gaJpZM4Hs8cF .

danielcrenna commented 7 years ago

@SergejJurecko I just wanted to confirm the comment regarding FTS, is it enabled in the sense that it's built-in, or do we need to enable extensions in etc/app.config and put our FTS library in the appropriate folder?

SergejJurecko commented 7 years ago

SQlite is compiled with DSQLITE_ENABLE_FTS5 so it must have it.

biokoda / actordb

actorDB and sqlite FTS features #24