anilpacaci / graph-benchmarking

7 stars 2 forks source link

Version of LDBC data generator used + some more documentation please. #18

Open pawanrawal opened 6 years ago

pawanrawal commented 6 years ago

We at Dgraph are trying to reproduce the benchmarks mentioned here https://event.cwi.nl/grades/2017/12-Apaci.pdf and write a blog post comparing Dgraph against the mentioned options. I have some questions. I am specifically interested in comparison against PostgreSQL, Titan and Neo4j.

  1. The schema of the data generated by the LDBC data generator(v0.2.6) seems to have changed and I get an error while importing data into Postgres. Is there something I am doing wrong here?
    psql:load_csv.sql:178: ERROR:  missing data for column "c_creator"
    CONTEXT:  COPY comment_f, line 2: "1236950581249|2011-09-17T06:26:59.961+0000|77.240.75.197|Chrome|yes|3"
    COPY 2719160
    psql:load_csv.sql:180: ERROR:  missing data for column "f_moderator"
    CONTEXT:  COPY forum_f, line 2: "0|Wall of Mahinda Perera|2010-03-17T07:32:20.447+0000"
    COPY 1629206
    COPY 309775
    psql:load_csv.sql:183: ERROR:  missing data for column "o_placeid"
    CONTEXT:  COPY organisation_f, line 2: "0|company|Kam_Air|http://dbpedia.org/resource/Kam_Air"
    psql:load_csv.sql:184: ERROR:  missing data for column "p_placeid"
    CONTEXT:  COPY person_f, line 2: "933|Mahinda|Perera|male|1989-12-03|2010-03-17T07:32:10.447+0000|192.248.2.123|Firefox"
    COPY 16836
    COPY 229166
    COPY 180670
    COPY 746332
    COPY 1470583
    COPY 20540
    COPY 7949
    COPY 21764
    psql:load_csv.sql:193: ERROR:  missing data for column "p_ispartof"
    CONTEXT:  COPY place_f, line 2: "0|India|http://dbpedia.org/resource/India|country"
    psql:load_csv.sql:194: ERROR:  missing data for column "p_creator"
    CONTEXT:  COPY post_f, line 2: "1236950581248||2011-09-16T22:05:40.595+0000|192.248.2.123|Firefox|uz|About Augustine of Hippo, ustin..."
    COPY 721295
    COPY 71
    COPY 70
    COPY 16080
    COPY 16080
    INSERT 0 746332
    INSERT 0 1470583
    INSERT 0 2719160
    INSERT 0 721295
    INSERT 0 0
    INSERT 0 0
    INSERT 0 1629206
    INSERT 0 309775
    INSERT 0 0
    INSERT 0 21764
    INSERT 0 7949
    INSERT 0 0
    INSERT 0 16836
    INSERT 0 229166
    INSERT 0 20540
    INSERT 0 180670
    INSERT 0 180670
    INSERT 0 0
    INSERT 0 0
    INSERT 0 16080
    INSERT 0 70
    INSERT 0 71
    INSERT 0 16080

Would be great if the load_csv.sql script can be updated or you can specify the version of the data generator that was used to generate this data.

  1. The paper mentions 4 types of query latencies but there are 13 LDBC queries in the benchmark. How are the queries grouped? Is there a framework for evaluating the read-only query performance for Postgres?

  2. Was the data ingestion done after adding the indexes or without them?

In general, some more documentation and steps to reproduce the benchmarks would be very useful.

anilpacaci commented 6 years ago

Hi,

Thanks for your interest, I hope it will be a useful tool and I would like to hear more about your work.

  1. The version of the datagen we have used for this study is v0.2.5. I believe working with that version would solve your problem. Otherwise I can update the load_csv.sql based on the version you use. I would appreciate if you can share the schema with me so that I can compare the original schema of datagen and schema of the tables created by he script.
  2. For the interactive workload we have used the short query mix from LDBC SNB Interactive Workload as indicated in the paper. Read-only query latencies are measured by only executing a specific type of query manually (person lookup for point queries, retrieving immediate friends for 1-hop neighbourhood, finding shortest path between two random person for SSSP etc.)
  3. Data ingestion was performed after the indexes built.
pawanrawal commented 6 years ago

I would appreciate if you can share the schema with me so that I can compare the original schema of datagen and schema of the tables created by he script.

I am using the schema mentioned at https://github.com/anilpacaci/graph-benchmarking/blob/master/snb-interactive-sql/scripts/db_schema.sql

Read-only query latencies are measured by only executing a specific type of query manually (person lookup for point queries, retrieving immediate friends for 1-hop neighbourhood, finding shortest path between two random person for SSSP etc.)

So was only one query with a specific id run per type to get the latencies?