Hi, I'm new to the Janus/Gremlin ecosystem. I started off with the cloud formation example and got it to work. Then I was modifying GraphOfTheGods Java code to load into my remote EC2 instance over SSH/port forwarded 8182. Using all the default settings; It's slow obviously but I see the number of unintelligible records increase in my dynamodb tables. 👍
My main real world data is actually in tons of Orc files on S3 which I can read in Spark. The data is analogous to tweets, with lots of supernodes. I assume I'll need Spark to pull out all the vertices/edges/attributes IF I need to load them separately for bulk loading. (To get unique vertices, edges). Is this what other people are doing in non-toy examples, or am I barking up the wrong tree?
Am I going to need a separate EMR cluster do do the bulk load, and have the code running on that cluster connect to my EC2 instance (dynamodb-janusgraph-storage-backend) that runs a gremlin shell? "dynamodb-janusgraph-storage-backend"?
-Or-
Should I use Spark to create CSVs or JSON of nodes/edges and then run something like the bulk importer from JanusGraph Utils to load in those lists?
-Or-
Am I supposed to run the "dynamodb-janusgraph-storage-backend" on a really Beefy EC2 instance and use the SparkGraphComputer on gremlin there somehow, to reach out and get the data like the GratefulDead demo?
What's the current accepted way to do bulk inserts on dynamodb-janusgraph-storage-backend?
If I'm using the AWS hosted DynamoDB as the storage mechanism vs something like Cassandra or HBase, am I losing out on any other types of bulk loading or OLAP/OLTP techniques? (Can I expect to reuse bulk loading examples that use local datastores/cassandra?)
Hi, I'm new to the Janus/Gremlin ecosystem. I started off with the cloud formation example and got it to work. Then I was modifying GraphOfTheGods Java code to load into my remote EC2 instance over SSH/port forwarded 8182. Using all the default settings; It's slow obviously but I see the number of unintelligible records increase in my dynamodb tables. 👍
My main real world data is actually in tons of Orc files on S3 which I can read in Spark. The data is analogous to tweets, with lots of supernodes. I assume I'll need Spark to pull out all the vertices/edges/attributes IF I need to load them separately for bulk loading. (To get unique vertices, edges). Is this what other people are doing in non-toy examples, or am I barking up the wrong tree?
Am I going to need a separate EMR cluster do do the bulk load, and have the code running on that cluster connect to my EC2 instance (dynamodb-janusgraph-storage-backend) that runs a gremlin shell? "dynamodb-janusgraph-storage-backend"?
-Or-
Should I use Spark to create CSVs or JSON of nodes/edges and then run something like the bulk importer from JanusGraph Utils to load in those lists?
-Or-
Am I supposed to run the "dynamodb-janusgraph-storage-backend" on a really Beefy EC2 instance and use the SparkGraphComputer on gremlin there somehow, to reach out and get the data like the GratefulDead demo?
What's the current accepted way to do bulk inserts on dynamodb-janusgraph-storage-backend?
If I'm using the AWS hosted DynamoDB as the storage mechanism vs something like Cassandra or HBase, am I losing out on any other types of bulk loading or OLAP/OLTP techniques? (Can I expect to reuse bulk loading examples that use local datastores/cassandra?)
The last question about bulk loading support for this flavor of datastore was in 2015, is it still relevant?:
https://github.com/awslabs/dynamodb-janusgraph-storage-backend/issues/9