Closed alran closed 4 years ago
We'd definitely love feedback on this proposal! In terms of timelines, I'm out next week but can answer questions when I get back. If the community thinks this makes sense, we will be able to dedicate time to upstreaming this work in Q3 (starting soon!)
thanks Alyssa, will take a look as well to learn more on the Neptune proxy design.
This looks great Alyssa! I was wondering during the data loading process when does old data get removed?
During the data loading jobs, we update the last_updated_at timestamp. It shouldn't be too hard at that point to remove data that doesn't have the most recent date set (we don't currently do this, but its something we need to do)
GremlinNeptune Square's Approach.pdf
Let me also upload @alran 's doc here which includes the Arch diagram.
hey @alran ,
Sorry for the delay in review.
I think overall Neptune allows AWS manage graph instead of neo4j self serve. One main concern will be the migration. Here are also a few questions I have:
Thanks a lot and sorry for the delay in reviewing.
cc @jinhyukchang as an expect on neo4j loader.
also do you use Apache TinerPop(http://kelvinlawrence.net/book/PracticalGremlin.html) ?
1. Is neptune HA? Yes! More info about this in the Neptune docs
2. What is the read / write throughput for Neptune? Reads and writes are different, and writes are different in bulk loading vs transactional. The bulk interface does something like 10K vertices a second (pretty fast). For transactional, even in a best case with kinesis/lambda we never got up to 1000’s a second. We got to high hundreds a second with lots of errors. We’ve never really tested the reads in this setup but it must be in the 1000s per second
3. Does Square's data store team provide Neptune as a service? AWS provides it as a service, so our data store team doesn’t do anything
4. Does Neptune provide a UI similar to Neo4j to explore graph relationships? Neptune has partners that do graph visualizations. We wired up Graphexp (on github)
5. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata? And how do you migrate your existing neo4j existing metadata to Neptune? It’s not possible to use Cypher with Neptune (but it IS possible to use Gremlin with Neo4j!). We were able to get a graphML dump from Neo4j and load it directly into Neptune using the bulk loader
6. How big (# nodes and # relationship) is the graph in Square which causes the transactional issues? And what is the production graph at Square as of today. Our graph is currently ~25M vertices and roughly the same number of edges. This fits in a 4XL or 2XL Neptune instance. A few weeks ago we turned it up to 4XL but watching it over the last week it’s been fairly idle so we will probably turn it down again. I don’t think the transactional issues were related to the size of the data so much as the write throughput. When we were doing the kinesis + lambda experiment, we exported data (generated CSV files that go into the bulk loader) and passed the resulting json objects into kinesis. Even though we were writing every node and edge exactly once, almost immediately we had concurrency exceptions. There would have been less than 100K nodes in the db at that point and we already had those exceptions
7. Let us know if you plan to change shared interface which we could discuss if so. It seems like a main shared interface that we’d be interested in changing is the way testing is set up in the metadata library. We have a concept of an “abstract_test_proxy” with tests that can run against any proxy (neo4j, gremlin, etc). This has helped us make sure that our new proxy was returning / doing exactly the same thing as the neo4j proxy. You’d still be able to override a specific proxy test if necessary. Was there anything else you were thinking about for this question? Am I misreading?
8. Do you implement the ingestion loader through Airflow DAG? No. But we could!
9. Looking at the diagram, your approach seems to match what we have with databuilder which you load directly to neptune instead through metadata? Any reason not leveraging databuilder framework (let us know if that's you plan to)? We have set this up inside our version of databuilder, so I think it should still work well with what exists in Amundsen! Jobs in our databuilder load the data directly in Neptune (which I think is the main difference)
10. Do you use Apache Tinkerpop? Yes! We really highly recommend the Practical Gremlin book you linked in your comment!
Regarding:
5. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata?
There is https://github.com/opencypher/cypher-for-gremlin though, which also mention Neptune with some test coverage. Haven't tried it though, and unsure if it's a good idea to wrap too many layers of different protocols "just because you can".
It seems the graph query language war isn't over yet. I noticed https://www.gqlstandards.org/ as a result of https://gql.today/. I assume it can be years out before a real standard emerges, let alone wide support.
Anyways, huge thanks for putting ☝️ info up here. This issue https://github.com/lyft/amundsen/issues/526 is becoming a good knowledge resource in and of itself.
Lastly, @alran with Path Forward in https://github.com/lyft/amundsen/issues/526#issue-645758230 any place you see others can contribute now/soonish, or do we just have to wait it becomes Q3 for y'all at Square? (We're potentially interested due to support in Azure via https://docs.microsoft.com/en-us/azure/cosmos-db/gremlin-support)
We did actually try out cypher-for-gremlin. We found that (1) it didn't generate idiomatic gremlin and (2) even the idiomatic gremlin upserts don't work great with Neptune.
I'm happy to report that we are currently in Q3! It started at the beginning of July. My main goal here is to make sure that the Amundsen community is happy with our approach so that we can relatively quickly upstream our work, once we begin. We are finalizing timelines for planned work for this quarter within our team on Monday.
FYI @friendtocephalopods is going to follow up here in the near future :)
Hi @feng-tao and other gremlin-interested friends,
Here's a slightly more detailed plan I've been thinking about:
add abstract proxy test cases in metadataservice with a neo4j proxy test using new PUT/POST endpoints in neo4j proxy. This will allow folks to roundtrip against a test database locally. It's not perfect -- y'all use databuilder code to write neo4j, but maybe in the mid term neo4j folks could refactor into something like the amundsen-neo4j
approach I'm planning for the gremlin proxy below, or even call these metadataservice endpoints from the databuilder. I think it's still more comprehensive than the mock test approach, but if you'd rather not have the PUT/POST code in the proxy I can skip adding the neo4j proxy tests. One caveat is that this test wouldn't run in CI without someone doing additional work to spin up a neo4j instance in the amundsen CI environment, but should still be useful locally. I can give suggestions if someone is trying to set it up but is having trouble; we did this internally when we still used neo4j in our CI.
Create a new amundsen-gremlin
repository, maintained by Square. I think this could also be submoduled into the root amundsen meta-repo? This repository would contain all the code to interact with neptune and would be shared between metadataservice and databuilder.
Add a neptune proxy test in metadataservice utilizing the amundsen-gremlin functionality. This would only be useful for teams running their own neptune config; there would be no shared instance for amundsen CI at launch (we could look into spinning up a cluster in amundsen CI with AWS open source credits for some shared neptune instance in CI if many folks end up contributing to amundsen-gremlin).
For the moment I'm mostly wondering if you're interested in the abstract proxy tests for neo4j using PUT/POST endpoints in the neo4j proxy. It's some additional work but I'd be happy to do it as an example useful to non-neptune folks of what an impl of the abstract proxy would look like. What do you think?
I need to think a bit more about whether we need that for neo4j. I think one of the pain points is to do an integration test between all three services(FE, metadata, search). Testing neo4j directly is fine if we support multiple neo4j version, but currently, we haven't had any plan or bandwidth yet for that. Or could you share your pr (pseudo-code) of what exactly PUT/POST endpoint are you referring to? In terms of the other point, I don't think we have plan to refactor the neo4j proxy into its own the proxy itself is a thin layer connect to neo4j which we have atlas proxy as well. At Lyft, we run neo4j as a different service which just bootstraps the neo4j itself with Lyft deployment staff. Both ETL(Airflow/ databuilder) and metadata could connect to the service.
For gremlin proxy, are you planning to add it to amundsen-gremlin
instead of metadata? If so, any reason? If not, I am curious what you plan to put into the repo. For the library or common sharable code, I think we could leverage amundsencommon repo?
Hi Tao, thanks for your quick response!
Sample excerpt of one put function in our neo4j proxy:
def put_column(self, *, table_uri: str, column: Column) -> None:
"""
Update column with user-supplied data
"""
column_uri: str = f'{table_uri}/{column.name}'
with LocalTransaction(driver=self._driver, commit_every_run=False) as ltx:
ltx.upsert(node={'key': column_uri,
'name': column.name,
'sort_order': column.sort_order,
'type': column.col_type},
node_type='Column')
ltx.link(node1={'key': column_uri}, node_type1='Column',
node2={'key': table_uri}, node_type2='Table', relation_name='COLUMN',
direction=Direction.TWO_TO_ONE)
# Add the description if present
if column.description is not None:
self.put_column_description(table_uri=table_uri, column_name=column.name, description=column.description)
I think if you pulled out the neo4j write code from databuilder into a separate package, it would probably look more like:
def put_column(self, *, table_uri: str, column: Column) -> None:
"""
for testing only, put a column
"""
amundsen_neo4j.put_column(table_uri, column)
It looks like y'all use the csv loader? So maybe put_column creates a csv and calls the neo4j load function? Or maybe it would look more likecsv = amundsen_neo4j.transform_column(column)
, amundsen_neo4j.load(csv)
Certainly the easiest thing for me is not to upload the PUT/POST endpoints we used and leave it as an exercise to others to. It does have the unfortunate effect that there's only one reference implementation for the abstract proxy tests and it's still only useful to us for a while, but maybe our put/post endpoints would be more trouble than they're worth for others using databuilder.
amundsen-gremlin
or amundsen-common
because then it can be used in databuilder for writes and metadataservice for roundtrip testing. I think either would be reasonable, but it might be surprising if amundsen-common
ended up being mostly gremlin-specific code. Our neptune bulk loader is about 2k lines of code, excluding some bits shared between databuilder and metadataservice we've already pulled into our fork of amundsen-common
sorry for the late reply, I am OOO most of the time this week except a few meetings. Will get back to the comment next week.
I like a Gremlin proxy certainly for its flexibility. Just a couple of caveats. In our use of Apache Atlas which relies on JanusGraph we found out that Gremlin interfacing tends to be slow, especially on searching the graph if that is required. Secondly, a Gremlin endpoint used to be available in Atlas, but was kicked out for security reasons. Basically, that endpoint allowed you to circumvent any authorization requirements required on the data. I suggest never to expose the Gremlin to proxy to the end user, it should only be used with particular API calls. I suggest verifying the performance of those endpoints elaborately and at scale. Finally, just a note to ourselves ;-), this probably warrants a permission model in Amundsen (hasPermission).
Interesting context re: Atlas, thanks for sharing!!
So far we haven't seen any significant performance issues on the read side. We've basically switched to using vertex ids everywhere, which likely helps a lot.
Definitely we are not planning to expose any user-facing gremlin query endpoints (the very idea gives me indigestion).
@bolkedebruin when you say you Atlas with JanusGraph search is low, were you searching external elastic search index? Based our experience of developing something similar on DataStack Enterprise Graph (DSE), we may need to introduce detailed partitioning mechanism to your graph. The same applies for Azure CosmosDB, or the cost will be expensive. Also, CosmosDB integration will be a little tricky, as their GraphSon v3 format is slightly different from others.
Definitely don't give user direct access to gremlin query endpoint. Most graph db do not have detailed permission scheme (DSE has Vertex label level permission because each vertex label is a Cassandra table).
From my side, I am happy to help port some gremlin integration once the approach is clear.
With the merging of the neptune proxy, I think we can close this ticket! WDYT @feng-tao ?
totally! thanks @friendtocephalopods @alran and everyone behind for the gremlin feature support!
Related Issue: #160 Pluggable Graph Database to Support Janus
Expected Behavior or Use Case
We have been working on integrating Neptune with our version of Amundsen. To do this, we built a gremlin proxy that can work across other gremlin-supported databases (i.e. JanusGraph, BlazeGraph, etc). Below I give an outline of the approach taken by Square to integrate our version of Amundsen with Neptune, and provide context and clarify design decisions.
Background
Deciding on Neptune: Square originally implemented Amundsen using Neo4j as our graph database. For a variety of reasons (mostly related to high fixed licensing costs of neo4j relative to the pay-per-use model of cloud databases), we decided to switch. We eventually settled on Neptune, which was not supported at the time by existing open-source Amundsen code.
Deciding on Gremlin: We chose to communicate with Neptune using the Gremlin query language because it was the most flexible across databases. With one gremlin query, we could use Neptune, JanusGraph, BlazeGraph... even Neo4j has a gremlin engine! (for learning, we recommend Practical Gremlin, by Kelvin Lawrence)
Things That Didn’t Work
Square spent a good amount of time experimenting with approaches to loading data in Neptune that did not ultimately work for our use case. We include them here as a warning for future contributors.
Our Architecture: Where We Landed
Eventually, we switched to using the Neptune Bulk Loader to write most data. At a high level, the flow looks like this:
<Check out the architecture image in the PDF linked here>
Our Gremlin Proxy in the Metadata Service handles reads and some low-rate writes from humans (tags, user descriptions, bookmarks, etc), similar to how the Atlas and Neo4j proxies work today. We made a few key changes:
Updated test framework
Shared code in Amundsen Common
An abstract gremlin proxy
Path Forward
With a focus on making the Gremlin Proxy usable as quickly as possible, I propose the following upstreaming plan, with each point below being an independent set of work:
CC Lyft @jinhyukchang @feng-tao CC Square @friendtocephalopods @worldwise001 @kathawthorne