biokoda / actordb

ActorDB distributed SQL database
Mozilla Public License 2.0
1.89k stars 72 forks source link

Cannot initialize on OSX #10

Open slanska opened 8 years ago

slanska commented 8 years ago

I am trying to follow setup steps (http://www.actordb.com/docs-getstarted.html#download_run) and on attempt to execute ./bin/actordb_console -f etc/init.example.sql, I am getting: Connect/login error: {error, {login_failed, <<"Username and/or password incorrect.">>}}

This project looks very interesting, indeed. What is its current status? I see that github sources have been updated, but documentation is stale and was not updated for long time. Is actordb used anywhere in production?

SergejJurecko commented 8 years ago

This error means ActorDB has already been initialized. You can run ./bin/actordb stop, delete everything in data/ directory. Then start it up again ./bin/actordb start, and initialize again using actordb_console

What's stale about the documentation? Yes it is used in production. The project is active (last release was 9 days ago).

We haven't made any real changes in the documentation for a while, because nothing written there has really changed in a while. Right now we are in bug fixing phase. With some feature updates.

slanska commented 8 years ago

Thanks a lot for the prompt response! I followed your instructions and was able to run and play a bit using actordb console. Then I decided to connect to the node via mysql client, I am using JetBrains' 0xDBE tool. I use the following settings: server=127.0.0.1, port=33307, user=myuser, password=mypass, db=music. On attempt to connect/run query I am getting: "The last packet successfully received from the server was 1 milliseconds ago. The last packet sent successfully to the server was 1 milliseconds ago. java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost." Am I missing something here? Thanks in advance.

SergejJurecko commented 8 years ago

I don't know this tool, I doubt it would work. ActorDB requires actor ... statements before running an SQL.

slanska commented 8 years ago

'Stale documentation' was a bit inaccurate term. What I meant is that current docs do not give comprehensive and clear understanding of how system is functioning. Need more examples, need more technical information about how data is stored, now map/reduce is working and how it is different from known alternatives, how system works in multi-cluster environment and how it distributes databases across many clusters, how replication is configured. I personally would be interested to see example of accessing actorDB via MySQL client. I think the idea is brilliant, but probably needs to be expressed in a more explicit way. I see that main target for this system is multi-tenant systems, with (semi)isolated relatively small databases, and ability to run map/reduce queries/updates (though, of course, it is not the main strength of the system). I need to learn more and play a bit to see how it goes in practice. Can you share info on what sort of production environment you currently deploy actor db? Number of nodes, database size, write/read load?

SergejJurecko commented 8 years ago

How data is stored in the storage engine is described here: http://blog.biokoda.com/post/133121776825/actordb-how-and-why-we-run-sqlite-on-top-of-lmdb

We do not have map/reduce at the moment.

how system works in multi-cluster environment

There is no real difference from a users point of view. Multi-cluster environment just means some actors are in one cluster, some actors are in another.

how it distributes databases across many clusters

An actor lives inside a cluster. A hashing function determines which actor lives in which cluster. The hashing namespace is divided across all nodes/clusters.

how replication is configured

There is only one replication configuration, how much history to remember. The default configuration files are full of comments. When nodes are added ActorDB will automatically start rebalancing.

Can you share info on what sort of production environment you currently deploy actor db? Number of nodes, database size, write/read load?

I will admit our current production setups are not under heavy load. They are 1/3/5 node single cluster setups. All tests are done using multi cluster configuration however. This is also an area which will see more extensive testing in the near future as well.

slanska commented 8 years ago

Thank you for the explanation. Question about connectivity - how I can connect to ActorDB via MySQL client? I tried another GUI utility (Valentina Studio, namely). I am getting "{error, no_actor_defined}" response.

slanska commented 8 years ago

Regarding overall system architecture - let me summarize what I learned from doc and your comments. Node is a physical or virtual machine, running actor db server, with many individual actor databases (in one huge LMDB file, I guess?). Cluster is a set of nodes (1/3/5...) for replication (using Raft consensus). So, all databases on one node get replicated on all other nodes in the cluster. Multi-cluster system - system which combines many clusters and is controlled by global actor (which is responsible for allocation of nodes for the new actors and re-allocation and transferring actors when new node/cluster gets added). Correct?

djustinek commented 8 years ago

@slanska: Regarding GUI utilities there can be a problem using the MySQL connector since most of them perfrom automated queries to resolve the schema to display it. These queries are in most cases uncompatible with the ActorDB.

The reason why you are getting such error in Valentina Studio is because it queries ActorDB with the following query to resolve MySQL schema before it allows writing SQL statements:

SELECT SCHEMA_NAME AS `Database`, DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE \tSCHEMA_NAME <> 'mysql' AND \tSCHEMA_NAME <> 'information_schema' AND \tSCHEMA_NAME <> 'performance_schema'

Same goes with JetBrains 0xDBE tool it tries to run this statement:

SELECT 1

Both of the statements are invalid from the ActorDB side since they don't contain the ACTOR definition within the statement.

We will take a look if we can support these utilities.

MySQL client is supported as a connector tool and it should work to querying the database through MySQL command line tool and existing MySQL drivers.

For playing with schema and querying the database you can also use actordb_console tool which is shipped with every ActorDB build and provides all the basic functionality to query the database.

SergejJurecko commented 8 years ago

mysql --host 127.0.0.1 --port 33307 -u myuser -p actor type1(u) create; insert into tab values (1,'aa',2); Query OK, 1 row affected (0.01 sec

actor type1(u) create; select * from tab; +------+------+------+ | id | txt | i | +------+------+------+ | 1 | aa | 2 | +------+------+------+ 1 row in set (0.00 sec)

SergejJurecko commented 8 years ago

Regarding architecture you are mostly correct. Global actor is just holds state and makes sure that state is spread through out all nodes. It does not make any decisions. Sequence of events when adding node:

  1. Store to global state with config
  2. Start up node
  3. New node receives global state
  4. It initializes itself with state
  5. It notices it does not have any shards (or has less shards than other nodes)
  6. It picks a node (one which has the largest chunk of hash space), and tells it give me half of one of your shards (upper half)
  7. Copy process starts
  8. Once complete, it checks again how much shard space it has, if it finds it has less than other nodes, it goes to step 6 again.
slanska commented 8 years ago

Sergej, djustinek, thanks for explanation! I will try to access actor db node via MySQL driver for node.js (this is what I am using at the moment for development). I read all available documentation and blog posts about ActorDB, that I found. Few things still remain unclear for me. I will appreciate if you could help me answering those: 1) How replication factor(cluster size) is determined? If, for example, system has 32 nodes, how I can configure replication factor of 3? I do not need specific steps for now, just looking for conceptual info. I have some experience with Riak and it has special setting in the configuration file, which determines replication factor. Also, is it possible to have different cluster size/replication factor for different databases in the same system? 2) For multi-node, multi-cluster system, is there just single connection address? Will system redirect requests to the right node to process? I.e. will it perform load balancing automatically? In that case, is there a chance that this node will become bottleneck due to high volume of incoming requests? 3) Need examples of accessing actor db via thrift. I do not have any experience with thrift, maybe it is obvious and identical to how thrift is used with, e.g., Cassandra. But exact sample code would be ideal. Example for node.js thrift client would be even more ideal. 4) Not sure how key value storage is working. Is it a just special kind of type (ACTOR type), which is spread across ALL nodes? How is it different from custom type?

slanska commented 8 years ago

Continue... 5) I realize you do not have any API/UI for monitoring/health checking/configuration of cluster? Any plans on that? 6) Do you perform any stress tests? What would be considerations for multi-tenant system with 10K-100+K users (i.e. separate actors), varying in size from few megabytes to few gigabytes, installed on cluster with dozens of nodes, with replication factor 2 or 3?

Overall, I like idea of actor db a lot and would like to play a bit. I think that documentation, examples and tests are areas for improvements.

SergejJurecko commented 8 years ago
  1. Every actor lives within a cluster. Replication factor is size of that cluster. If you have 32 nodes, you should split them into clusters of 3 nodes. In that case number 32 is not good. Either 30 or 33. You can not configure which actors are in which cluster, so you can't configure the cluster size depending on actor type.
  2. You can connect to any node. If actor you are querying is not on that node, it will proxy the request. It is recommended to have a pool of connections to multiple nodes.
  3. Install thrift. Run: thrift -help to see a list of languages, then run thrift --gen mylang adbt.thrift Get the adbt.thrift file here: https://github.com/biokoda/actordb/blob/master/adbt.thrift It will generate code that you can use to connect to ActorDB.
  4. KV data model is an actor per shard. We have ~4 shards per node. This means every node will have 4 actors with the database schema that you set.

If mytype is KV, then this actor statement: actor mytype(mykey); Will mean mykey is not an actor, but it is just a key. This key will be hashed to determine which shard it belongs to, then the SQL statements will be sent to that shard(actor). You can use multiple tables, but they all must use a foreign key to actors table. Compared to most other KV stores it is a very powerful model, because your value is not just a blob, it can be multiple columns and multiple tables.

  1. This is at the top of our list.
  2. Just like any other DB. SSDs are highly recommended, lots of RAM is important. Nothing special otherwise. We do perform stress tests and large scale testing just like 5. is at the top of our todo list.
slanska commented 8 years ago

since all individual db files are stored in a single lmdb, how it is efficient in terms of highly concurrent access? Correct me if I am wrong, but I was under impression that lmdb locks entire file for writing?

SergejJurecko commented 8 years ago

The dirty secret of most storage engines is that they are single writer. Actual reads are generally concurrent and lmdb has the best read concurrency of any engine, but you can't write with multiple threads on actual hard drives/filesystems and have good performance. Unless you are only appending to a file and nothing else, then you can do it with multiple threads. This holds true for lmdb, leveldb and I'm pretty sure postgresql as well.

It's pointless to worry about it really. Everything has the same problem.

slanska commented 8 years ago

Nice explanation! Thanks. Though, I am still not sure whether it is applicable to SSD drives? For HDD it is clear - access is sequential, as there is one head only. Even with I/O optimization by SCSI drives. But SSD is different - it allows random access with  zero overhead for "repositioning".

On Monday, November 23, 2015 9:52 AM, Sergej Jurečko <notifications@github.com> wrote:

The dirty secret of most storage engines is that they are single writer. Actual reads are generally concurrent and lmdb has the best read concurrency of any engine, but you can't write with multiple threads on actual hard drives/filesystems and have good performance. Unless you are only appending to a file and nothing else, then you can do it with multiple threads. This holds true for lmdb, leveldb and I'm pretty sure postgresql as well.It's pointless to worry about it really. Everything has the same problem.— Reply to this email directly or view it on GitHub.

SergejJurecko commented 8 years ago

Maybe. There is still the matter of the file system. You can run multiple storage folders on the same drive and see if it truly doubles performance. ActorDB will spread actors across additional folders.

slanska commented 8 years ago

Do you have any benchmark numbers for the scenario with multiple database access? I guess that placing all actors into single lmdb file might introduce unnecessary  performance restrictions in compare to scenario where every database lives in int own file. Again, it is just my guess, needs to be confirmed empirically.

On Monday, November 23, 2015 11:03 AM, Sergej Jurečko <notifications@github.com> wrote:

Maybe. There is still the matter of the file system. You can run multiple storage folders on the same drive and see if it truly doubles performance. ActorDB will spread actors across additional folders. — Reply to this email directly or view it on GitHub.

SergejJurecko commented 8 years ago

Do you have any benchmark numbers for the scenario with multiple database access?

Using multiple lmdb's on a single ssd? No I do not.

I guess that placing all actors into single lmdb file might introduce unnecessary performance restrictions in compare to scenario where every database lives in int own file. Again, it is just my guess, needs to be confirmed empirically.

Every actor being in its own file? So using sqlite3 files directly on disk? That would perform awfully. Hard drives are not made for that kind of concurrency. A big advantage of lmdb is also that it does not copy data around and we can compress the pages before writing. Once written it stays where it is (sqlite checkpoints wal file so everything is written twice).

SSDs are much better at random access, I'm not so sure they are that much better at concurrency.