Version of LDBC data generator used + some more documentation please.

pawanrawal commented 6 years ago

We at Dgraph are trying to reproduce the benchmarks mentioned here https://event.cwi.nl/grades/2017/12-Apaci.pdf and write a blog post comparing Dgraph against the mentioned options. I have some questions. I am specifically interested in comparison against PostgreSQL, Titan and Neo4j.

The schema of the data generated by the LDBC data generator(v0.2.6) seems to have changed and I get an error while importing data into Postgres. Is there something I am doing wrong here?

psql:load_csv.sql:178: ERROR:  missing data for column "c_creator"
CONTEXT:  COPY comment_f, line 2: "1236950581249|2011-09-17T06:26:59.961+0000|77.240.75.197|Chrome|yes|3"
COPY 2719160
psql:load_csv.sql:180: ERROR:  missing data for column "f_moderator"
CONTEXT:  COPY forum_f, line 2: "0|Wall of Mahinda Perera|2010-03-17T07:32:20.447+0000"
COPY 1629206
COPY 309775
psql:load_csv.sql:183: ERROR:  missing data for column "o_placeid"
CONTEXT:  COPY organisation_f, line 2: "0|company|Kam_Air|http://dbpedia.org/resource/Kam_Air"
psql:load_csv.sql:184: ERROR:  missing data for column "p_placeid"
CONTEXT:  COPY person_f, line 2: "933|Mahinda|Perera|male|1989-12-03|2010-03-17T07:32:10.447+0000|192.248.2.123|Firefox"
COPY 16836
COPY 229166
COPY 180670
COPY 746332
COPY 1470583
COPY 20540
COPY 7949
COPY 21764
psql:load_csv.sql:193: ERROR:  missing data for column "p_ispartof"
CONTEXT:  COPY place_f, line 2: "0|India|http://dbpedia.org/resource/India|country"
psql:load_csv.sql:194: ERROR:  missing data for column "p_creator"
CONTEXT:  COPY post_f, line 2: "1236950581248||2011-09-16T22:05:40.595+0000|192.248.2.123|Firefox|uz|About Augustine of Hippo, ustin..."
COPY 721295
COPY 71
COPY 70
COPY 16080
COPY 16080
INSERT 0 746332
INSERT 0 1470583
INSERT 0 2719160
INSERT 0 721295
INSERT 0 0
INSERT 0 0
INSERT 0 1629206
INSERT 0 309775
INSERT 0 0
INSERT 0 21764
INSERT 0 7949
INSERT 0 0
INSERT 0 16836
INSERT 0 229166
INSERT 0 20540
INSERT 0 180670
INSERT 0 180670
INSERT 0 0
INSERT 0 0
INSERT 0 16080
INSERT 0 70
INSERT 0 71
INSERT 0 16080

Would be great if the load_csv.sql script can be updated or you can specify the version of the data generator that was used to generate this data.

The paper mentions 4 types of query latencies but there are 13 LDBC queries in the benchmark. How are the queries grouped? Is there a framework for evaluating the read-only query performance for Postgres?
Was the data ingestion done after adding the indexes or without them?

In general, some more documentation and steps to reproduce the benchmarks would be very useful.

anilpacaci commented 6 years ago

Hi,

Thanks for your interest, I hope it will be a useful tool and I would like to hear more about your work.

The version of the datagen we have used for this study is v0.2.5. I believe working with that version would solve your problem. Otherwise I can update the load_csv.sql based on the version you use. I would appreciate if you can share the schema with me so that I can compare the original schema of datagen and schema of the tables created by he script.
For the interactive workload we have used the short query mix from LDBC SNB Interactive Workload as indicated in the paper. Read-only query latencies are measured by only executing a specific type of query manually (person lookup for point queries, retrieving immediate friends for 1-hop neighbourhood, finding shortest path between two random person for SSSP etc.)
Data ingestion was performed after the indexes built.

pawanrawal commented 6 years ago

I would appreciate if you can share the schema with me so that I can compare the original schema of datagen and schema of the tables created by he script.

I am using the schema mentioned at https://github.com/anilpacaci/graph-benchmarking/blob/master/snb-interactive-sql/scripts/db_schema.sql

Read-only query latencies are measured by only executing a specific type of query manually (person lookup for point queries, retrieving immediate friends for 1-hop neighbourhood, finding shortest path between two random person for SSSP etc.)

So was only one query with a specific id run per type to get the latencies?

anilpacaci / graph-benchmarking

Version of LDBC data generator used + some more documentation please. #18