SiriDB / siridb-admin

SiriDB - Tool for managing SiriDB databases
MIT License
12 stars 7 forks source link

better doc request: pool, replica, pool-id #8

Open alekibango opened 6 years ago

alekibango commented 6 years ago

I am trying to understand how siridb works. What exactly replica and pool is? How it works? What exactly is the relation between server, pool, replica, database, timeseries, shards ? How do i check (list) how i did configure those relations?

PS: i just found blog post http://siridb.net/blog/how-we-store-time-series-in-siridb/ which is interesting, and i really like the explanation about files and indexes, but the blog post is not good enough to solve this issue.

Idea: making simple (UML) diagram explaining relations of server, pool, replica, database, timeseries, shards might be very good.

Also help text is somewhat wrong here (new pool instead of new replica):

$ siridb-admin_1.1.3_linux_amd64.bin  --help  2>&1 |tail -n 7
  new-pool --db-name=DB-NAME --db-user=DB-USER --db-password=DB-PASSWORD --db-server=DB-SERVER [<flags>]
    Expand a SiriDB database with a new pool.

  new-replica --db-name=DB-NAME --db-user=DB-USER --db-password=DB-PASSWORD --db-server=DB-SERVER --pool=POOL [<flags>]
    Expand a SiriDB database with a new pool.
alekibango commented 6 years ago

Answering myself, this should be improved and placed into docs somewhere

Time Series (TS) = named, ordered set of [time, value]; Server = resource we can connect to, for example to insert or query data. Pool = group of servers. data written to one server will be replicated to all servers in a pool,

Time Series are always (automatically) assigned to a Pool. They cannot move to another pool, They can rarely move to another pool (when we add new pool).

Shard = continuous and time limited part of TS (from - to); All shards of one TS are stored in the same pool (which one is decided by time series name) Replica = other server(s?) in a pool,

More questions: Will writing data to any server return success even if it is not yet replicated anywhere? is it possible to configure number of successfull replications before returning success? (1, 2. .., N, most, 40%, 67%, ...)

joente commented 6 years ago

Thank you for the feedback. We will soon update the documentation and try to clarify some concepts.

Your last questions:

Will writing data to any server return success even if it is not yet replicated anywhere?

The server will respond with success in case the data is at least saved in a replication (queue) file so yes, it can return successful before the data is actually replicated.

is it possible to configure number of successful replications before returning success? (1, 2. .., N, most, 40%, 67%, ...)

No, this is not possible and the current version of SiriDB supports only two servers in each pool.

seybi87 commented 4 years ago

Hi guys,

I am currently looking into horizontally scalable time-series DBMS and so I came across the very interesting SiriDB.

Yet, I have similar problems in understanding the distribution concepts of SiriDB based on the exisitng documentations.

In particular, I have the following questions:

Thanks a lot in advaance for your help!

joente commented 4 years ago

Hi @seybi87,

Here are a few benchmarks we did to compare some cloud solutions. The benchmark results show that adding more pools results in an almost linear insert performance.

Azure (Netapp files, SiriDB 1 Pool): loaded 218160000 metrics in 757.500sec with 2 workers (mean rate 288.000. metrics/sec)

Azure + ONTAP Cloud (SiriDB 1 Pool): loaded 218160000 metrics in 877.671sec with 2 workers (mean rate 248.566 metrics/sec)

Google Cloud (SSD, SiriDB 1 Pool): loaded 218160000 metrics in 659.427sec with 2 workers (mean rate 330.832 metrics/sec)

Google Cloud (HDD, SiriDB 1 Pool): loaded 218160000 metrics in 663.925sec with 2 workers (mean rate 328.591 metrics/sec)

Google Cloud (HDD, SiriDB 2 Pools): loaded 218160000 metrics in 332.087sec with 2 workers (mean rate 656.936 metrics/sec)

Google Cloud (HDD, SiriDB 5 Pools (bottle neck, one insert host)): loaded 218160000 metrics in 264.688sec with 5 workers (mean rate 824.214 metrics/sec)

Google Cloud (HDD, SiriDB 5 Pools (split over 2 insert hosts)): loaded 436320000 metrics in 281.921sec with 10 workers (mean rate 1.547.667 metrics/sec)

seybi87 commented 4 years ago

Hi @joente , thanks a lot for the quick response and the detailed information.

I have just two follow up questions based on your provided information:

joente commented 4 years ago

@seybi87 ,

No, that's not exactly right. Once you add a replica, both servers in the pool receive the replica role and data will be synchronized across both server. Both servers are active and will randomly be chosen to handle query requests.

Maybe an example is easier to understand. Suppose you have four SiriDB servers, then both of these configurations are possible:

Four servers, four pools (no redundancy and no replica roles)

SERVER0  -  POOL0  (database is created on this server)
SERVER1  -  POOL1  (added as a new pool)
SERVER2  -  POOL2  (added as a new pool)
SERVER3  -  POOL3  (added as a new pool)

Or, four servers, two pools (redundancy in each pool)

SERVER0  -  POOL0  (database is created on this server)
SERVER1  -  POOL0  (added as a replica for pool 0)
SERVER2  -  POOL1  (added as a new pool)
SERVER3  -  POOL1  (added as a replica for pool 1)

Note that adding a server (either as a replica or pool) has no impact on the running database. SiriDB extends in the background and the progress can be viewed with show reindex_progress or show sync_progress see https://docs.siridb.net/database/status_information/.

For the benchmarks we used TSBS.

seybi87 commented 4 years ago

@joente thanks a lot that clarified all my questions!