fabric8-analytics / gremlin-docker

2 stars 20 forks source link

Split Titan and Gremlin - impossible to scale #10

Closed fridex closed 7 years ago

fridex commented 7 years ago

As the current stack looks like the following:

DynamoDB - Titan - Gremlin

We would like to split Titan and Gremlin into two standalone containers which would allow us to scale Gremlin independently. Note that creating multiple Titan instances talking to the same DynamoDB tables results in data inconsistencies and data corruptions as stated in "Titan Limitations":

Running multiple Titan instances on one machine backed by the same storage backend (distributed or local) requires that each of these instances has a unique configuration for storage.machine-id-appendix. Otherwise, these instances might overwrite each other leading to data corruption. See Graph Configuration for more information.

Source: http://titan.thinkaurelius.com/wikidoc/0.3.1/Titan-Limitations.html

tuxdna commented 7 years ago

Totally agree with having only one TitanGraph instance.

msrb commented 7 years ago

My understanding (may be completely wrong) is that Gremlin is just a thin wrapper sitting in front of Titan. So, ideally, we would like to be able to scale (also) Titan, as titan is the component that does all the heavy lifting.

Can we scale Titan as well?

fridex commented 7 years ago

My understanding (may be completely wrong) is that Gremlin is just a thin wrapper sitting in front of Titan. So, ideally, we would like to be able to scale (also) Titan, as titan is the component that does all the heavy lifting.

If I understand the overall architecture correctly, gremlin also performs query strategy evaluation and path traversal computations to optimize queries. So we would probably save something if we scale Gremlin (besides possible connection reuse?).

Can we scale Titan as well?

I think this will not be doable - based on the configuration section, it is not possible to let two or more Titans talk to the same graph - see storage.machine-id-appendix [1] as stated above in:

http://titan.thinkaurelius.com/wikidoc/0.3.1/Graph-Configuration.html

krishnapaparaju commented 7 years ago

Do we have current issues documented (for ingestion onto Gremlin layer.. ) ?

krishnapaparaju commented 7 years ago

What kind of issues we getting into when more workers / threads / containers run in parallel and ingest onto Gremlin ? Any plan action can be based on this.

fridex commented 7 years ago

What kind of issues we getting into when more workers / threads / containers run in parallel and ingest onto Gremlin ?

We wanted to scale Gremlin (now we can scale only Gremlin+Titan as it is one container) when we were on heavy load. It took quiet a while to store/retrieve data from the graph database. I will need to check communication parameters or find a bottleneck there. Also on UI level there are done retries as it takes time to get data.

Currently the main issue is the data_model importer which is an API that simply pushes data into the graph. There was a plan to remove this API service (save one container in deployment) and directly let workers to communicate with Gremlin.

From my perspective there could be a nice opportunity to write a really small library that would help us with gremlin communication, serializing query results and constructing queries (it could directly use schemas we maintain for task results) - it could be used on worker side and on API side to utilize work with the graph. This will also address other concerns we currently have.

miteshvp commented 7 years ago

Running multiple Titan instances on one machine backed by the same storage backend (distributed or local) requires that each of these instances has a unique configuration for storage.machine-id-appendix. Otherwise, these instances might overwrite each other leading to data corruption. See Graph Configuration for more information.

Source: http://titan.thinkaurelius.com/wikidoc/0.3.1/Titan-Limitations.html

We are using Titan 1.0.0. This document relates to pretty old version of Titan.

krishnapaparaju commented 7 years ago

Which mode are we using in production for Titan ? multiple/ single item model ?

tuxdna commented 7 years ago

@krishnapaparaju Do you mean the SINGLE vs MULTI as documented here in the document below:?

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Tools.TitanDB.BestPractices.html

As per my understanding of the DynamoDB Titan plugin, gremlin is using MULTI by default.

miteshvp commented 7 years ago

Which mode are we using in production for Titan ? multiple/ single item model ?

@krishnapaparaju - yes we are using multi item model.

krishnapaparaju commented 7 years ago

Thanks @miteshvp @tuxdna .. I am sure MULTI would have issues on write side. Any side affects we get into if we switch to SINGLE ? If not I would start with (a) change to SINGLE model (b) Observe details for WRITE-CAPACITY-UNITS-USED at AWS.

krishnapaparaju commented 7 years ago

Once these issues are resolved, we might want to move to JanusGraph (again with AWS plugin) , this would protect our prior investments both at Gremlin (code) and DynamoDB (deployment). Don't think we need to be doing anything related to JanusGraph anytime soon, this is more of FYI for moving to a DB which is actively maintained and very minimal rework.

https://github.com/awslabs/dynamodb-janusgraph-storage-backend

miteshvp commented 7 years ago

Any side affects we get into if we switch to SINGLE ?

  1. This means converting entire graph to write in SINGLE mode from scratch.
  2. A limit of 400kb per node, which is still ok for our nodes.
  3. Graphs with low vertex degree and low number of items per index can take advantage of this implementation. We have rather a big skew towards less indices but more items.
  4. We will definitely save time on writes (initial graph loads), but with previous experimentation for reads, there was hardly any difference to fetch. That was one of the reasons for not going ahead and re-writing entire graph in SINGLE mode.

cc @krishnapaparaju

krishnapaparaju commented 7 years ago

@miteshvp thanks. Can start with experimenting with SINGLE model (on a different instance) to check if we get better on WRITE side. Good to know that changing to SINGLE would not lead to any READ side challenges. Based on these results from experiment, can plan course of action.

miteshvp commented 7 years ago

closing this issue w.r.t #11