MartinHaeusler / chronos

The Chronos versioning project aims to provide easy-to-use and reliable versioned data storage.
52 stars 7 forks source link

A couple of questions about Chronograph #2

Closed Kentoseth closed 7 years ago

Kentoseth commented 7 years ago

Hello,

This isn't an issue, but more of an FAQ concerning Chronograph. See below for my questions:

1) Do I have to use Java or can I use the Gremlin Query Language instead?

2) Is this the only known-solution for time-based versioned-graph? (if there are others that you know of, can you please share?)

3) (This is a bit of a longer question) Does this software address the friends of friends/ people you may know concept that you find on social media (via graphs).

eg. A is friends with B - January: March
      B (friends) C - January: June
      C (friends) D - February: July
      C (friends) F - Jan: September
      E (friends) F - March: August

  -> In February: C can be suggested as someone A can befriend (1 position away, via B)
  -> In April: C cannot be suggested as friend of A (1-position away)
  -> In Feb: D can be suggested friend of A (2 positions away)
  -> In April: D cannot be suggested friend of A (2-positions)
  -> February: A befriends C:
    --> February: F (1-position) can be friend to A
    --> February: E (2-positions) cannot be suggested to A
    --> April: E (2-positions) can be suggested to be friends with A

Using Chronograph, will I be able to make queries to know when a person could be friends with A, based on the time of the query? (eg. check who can be friend of A in April )

MartinHaeusler commented 7 years ago

Hello,

first of all - thank you for your interest in this project. About your questions...

  1. Currently, ChronoGraph exclusively works as a process-embedded database. So, yes, at the moment you have to use Java. Apache TinkerPop offers Gremlin Server which allows to host TinkerPop-compliant databases as servers and communicate with them via REST calls (which can be sent from an arbitrary programming language). I never really tried it, but it might work with ChronoGraph as well. However, you will not have access to the versioning-related features of ChronoGraph. In fact, I am currently developing a server for ChronoGraph that works in a similar fashion as Gremlin Server, except that it also allows access to the additional features that ChronoGraph provides. I can't offer any estimate when this will be released though, as this also brings in new issues, such as security and authentication, that were not present before.

  2. At least for TinkerPop-compliant graph databases, I can state with confidence that ChronoGraph is currently the only solution that offers built-in versioning and branching capabilities. I have read some academic papers on other solutions, but they are a) prototypes (i.e. not production-ready), b) not TinkerPop-compliant and c) often focus on analysis only (i.e. no or very limited online transaction processing capabilities). If you are still interested, here is an example of such a solution (the similarity in name is entirely coincidental; we are not affiliated with these authors in any way). However, there are also other ways. In theory, you could also use a standard graph database (like Neo4J, Titan...) and layout your graph such that it supports multiple versions of itself. Examples of such approaches can be found here and here. However, such approaches will drastically increase the complexity of all your queries - basically you will have to deal with the versioning process yourself, all the time. This was one of the reasons why ChronoGraph was originally created. It handles the versioning process internally and transparently for you.

  3. I know about the "friend-of-a-friend" concept, but never implemented it myself. What I can say for sure is: you can definitly do this with TinkerPop and Gremlin (even though this topic is more popular in the RDF community). And since ChronoGraph is a TinkerPop-compliant database, you can also do it in ChronoGraph. We offer full support for attributes on vertices and edges, and you can query them using regular gremlin. If you can observe the changes in the friendship graph "live", then you might even consider relying on ChronoGraph's versioning engine to keep track of the friendship time ranges for you (you can ask the graph when a given edge was created or changed; check out g.getEdgeHistory(...)). If you have a static snapshot, then you would have to write the time ranges into edge properties and query them using gremlin. Just in case that you are starting out from a FOAF dataset in RDF format: you will first need to convert it into a TinkerPop graph. This can be done e.g. by converting the RDF file into GraphSON and importing that into ChronoGraph, or by parsing the RDF data in Java and then creating the necessary vertices and edges on the fly using the regular TinkerPop API.

I hope this answers your questions. Feel free to respond to this issue if you have further questions.

Kentoseth commented 7 years ago

Thank you for answering my questions in such detail, it is much appreciated.

  1. I was hoping it might have been Gremlin-language compatible, as this language is a bit easier to use/query than Java itself. I will still try it out (and maybe learn some Java too).

  2. I did find this link previously: link which is perhaps how I found your project (trawling through search-results and the google-group results). I will read-up on the other links too (thanks for sharing)

  3. Just to clarify on the following: If I implement my friends of friends example, then Chronograph will time-version the graph itself when I use an edit like:

April: A and B are no longer friends, -> C cannot be suggested as friend of A (1-position away)

Also, if you would like to discuss further (what my goals are), you can reach me on my GH-profile email.

MartinHaeusler commented 7 years ago

Hi,

sorry for the late response, my calendar was overflowing with events during the past couple of days... About your questions.

  1. I am not entirely sure that I understand you correctly. Gremlin always was (and still is) a domain specific language that is embedded in Java. Whenever you write a Gremlin query, you are effectively writing Java code. Sure, there are ways to "hide" it, for example by using Groovy and providing some syntactic sugar (like the Gremlin REPL shell does), but at the end of the day, a Gremlin query is always expressed as Java source code. Any significant modern graph database that I am aware of is coded in Java or has at least a Java API (Neo4J, Titan, Orient, ...). What I am getting at is: at the current technological state, if you want to work with graph databases, you will have to work with Java eventually, one way or the other. Studying a bit of Java is therefore certainly a good idea. If you don't like Java's syntax, you can use TinkerPop databases also with JVM-compatible languages (e.g. Groovy, Xtend, JRuby, Kotlin...). I know from personal experience that TinkerPop and ChronoGraph play fairly nice with Groovy, no guarantees regarding the other languages though.

  2. No problem. If you are interested in academic papers on these aspects, I will present a paper about ChronoGraph at DATA 2017, I will provide a link to it in the GitHub readme files as soon as the proceedings are published.

  3. To fully answer this question (and I do believe that this detail is very important) I need to explain some background information. As you know, ChronoGraph (and the underlying store, ChronoDB) are system-time versioning (a.k.a. Transaction Time Versioning). This means that the version history of every entry is managed for you in a transparent way, i.e. you don't have any obligations on your part to make it work. One of the primary constraints is that you can only append new versions at the end of the history; you can not change the past. Furthermore, the timestamp at which the insertion of a new entry occurs is not controlled by the programmer, but by the database itself. You cannot choose to insert a new entry at a given date and time; instead, ChronoGraph will append that change to the end of the history automatically, using the current system time (thus the name "system time versioning") as the timestamp. You can then query this managed history using a mixture of Gremlin (for queries that are working on a given timestamp, but do not need to jump between timestamps) and custom operations only avaliable on ChronoGraph (e.g. g.getVertexHistory(...) and many others). The consequence for your friend-of-a-friend graph is: if your graph is running on a server backend, and you can witness all changes "live" as the changes come in (e.g. the addition of a new friendship edge, adding a new person...), then you can rely on ChronoGraph's versioning to do exactly what you said in your question. Just set your timestamp to a date and time in april, and query this single version of the graph via gremlin. However, if you are working e.g. under laboratory conditions and need to perform simulations that need to run within minutes or hours, but simulate years of events, then ChronoGraph's versioning capabilities will not help you for this particular case. You cannot "stretch" or "staunch" the flow of time in ChronoGraph. The system observes changes and versions them according to the real system time. In such a case, you would annotate your edges with properties that indicate from when to when a friendship has existed.

TL;DR: The main point is that ChronoGraph always uses the system time for versioning. Time cannot be simulated to run faster or slower in ChronoGraph. If your use case needs to simulate large time ranges in short periods, then ChronoGraph's versioning capabilities will not help you directly to solve the problem. You would have to use properties on edges instead to express time ranges explicitly.

Kentoseth commented 7 years ago

1) Thank you for clarifying that. You are right in that majority of the graph-database technology out there is written in Java (or uses the JVM somehow).

3) thank you for clarifying this as well. I was seeking a way to insert historical-data, but based on what I am trying to do, finding/inserting that historical data itself will be unlikely/difficult, therefore using the system-time to do data-inserts as the data is "pulled" from relevant APIs will work too.

Thank you very much for your patience in answering me.

I look forward to using your software.