Provide schema to use with arangoimp

ArangoDB-Community / arangodb-tinkerpop-provider

An implementation of the Tinkerpop OLTP Provider for ArangoDB

Apache License 2.0

84 stars 16 forks source link

Provide schema to use with arangoimp #62

Closed MartinBrugnara closed 3 years ago

MartinBrugnara commented 3 years ago

Issue: native thinkerpop loader does not scale (neither in memory, nor in time).

A sensible solution would be to use arangoimp. Please publish a schema for json/jsonl to use with that tool.

Cheers, M

arcanefoam commented 3 years ago

Can you be a bit more specific about the scale issue and what/how would arangoimp be used?

MartinBrugnara commented 3 years ago

The native tinkerpop loader does not scale because it keeps all the parsed graph in memory.

A simple solution to avoid that is to parse the dataset twice. On the first pass insert the nodes, on the the second the edges.

But even this is not enough, it stops when importing datasets in the order of 10GB, the tinkerpop API seams to be too slow for importing. The next step the is to replace these with the bulk importer arangoimp. But, in order to do that, one needs to know how the tinkerpop adapter stores the data inside the actual database ... hence this issue.

Anyway, I was able to infer it from a smaller instance, I am thus closing.

MartinBrugnara commented 3 years ago

If you like, the the following gist is an implementation of what I described above. The code is not clean, but it scaled to 41GB without any problems..

https://gist.github.com/MartinBrugnara/ff68b1a0d7572a7f7361bd4d6b120f4d

arcanefoam commented 3 years ago

Thanks for the code! One of my enhancement ideas on the todo list is efficient memory use, so that the whole graph is not in memory. Hopefully I will have some time in this life to work on the next version of this provider :).

MartinBrugnara commented 3 years ago

We will release all our code ( hopefully soon ;) ), maybe you will be able to borrow some parts. Cheers, M

dothebart commented 3 years ago

Hi, let me add some words to this. In the end you're using arangoimp, which has an autopacing algorithm implemented to stop the import process from overrunning the I/O capabilities of your installation:

https://www.arangodb.com/docs/stable/programs-arangoimport-details.html#automatic-pacing-with-busy-or-low-throughput-disk-subsystems

IO limitations are demonstrated in this article; there are buffers that fill up; once thats full, you are down to the raw i/o load that your system can bear:

https://www.arangodb.com/docs/stable/tutorials-reduce-memory-footprint.html#testing-the-effects-of-reduced-io-buffers

if you overrun it further then, the bad happenes, everything comes to a grinding halt. If you want to implement something similar to write bulk requests for the import, I would try to double the time of a bulk request by sleeping afterwards.