feat: Support Neptune - Githubissues

alran commented 4 years ago

Expected Behavior or Use Case

We have been working on integrating Neptune with our version of Amundsen. To do this, we built a gremlin proxy that can work across other gremlin-supported databases (i.e. JanusGraph, BlazeGraph, etc). Below I give an outline of the approach taken by Square to integrate our version of Amundsen with Neptune, and provide context and clarify design decisions.

Background

Deciding on Neptune: Square originally implemented Amundsen using Neo4j as our graph database. For a variety of reasons (mostly related to high fixed licensing costs of neo4j relative to the pay-per-use model of cloud databases), we decided to switch. We eventually settled on Neptune, which was not supported at the time by existing open-source Amundsen code.
Deciding on Gremlin: We chose to communicate with Neptune using the Gremlin query language because it was the most flexible across databases. With one gremlin query, we could use Neptune, JanusGraph, BlazeGraph... even Neo4j has a gremlin engine! (for learning, we recommend Practical Gremlin, by Kelvin Lawrence)

Things That Didn’t Work

Square spent a good amount of time experimenting with approaches to loading data in Neptune that did not ultimately work for our use case. We include them here as a warning for future contributors.

Writing Gremlin transactionally
- Previously, we’d implemented our version of Amundsen using transactional calls to Neo4j written directly in the Metadata Service, rather than using the bulk loading code included with Data Loader. In this approach, we simply translated our initial code to Gremlin query syntax.
- This worked well for test-scale data, but scaled terribly with real data (including upsert violations, which resulted in us having duplicate vertices in our database, slow roundtrip times, and confusing timeout errors)
Using SessionedClient
- We made use of this code buried deep in the amazon-neptune-tools repository to run multiple gremlin statements within a single session (and somewhat reduce issues related to timeouts and duplicates).
- This still didn’t fully solve the scaling issues we faced above.
Batching writes with AWS Lambda
- We experimented with using a Lambda function based on this example from the aws-samples repository.
- Throughput was not as high as we wanted it to be, and we were still getting concurrency errors even when writing all edges and vertices perfectly

Our Architecture: Where We Landed

Eventually, we switched to using the Neptune Bulk Loader to write most data. At a high level, the flow looks like this:

Data Loader gets data from our data sources
Data Loader queries Neptune for existing edge and vertex information (i.e. if there is already a vertex that exists for some data, we get what we have, using deterministic vertex and edge IDs)
Data Loader decides which data is correct (i.e. if we have a table vertex with some properties in Neptune that aren’t in the data we have from our data sources, we can write logic to decide which data is correct)
Data Loader writes vertex and edge data to an S3 CSV file, in a specific format expected by the Neptune Bulk Loader.
Data Loader sends a POST request directly to Neptune with information about the CSV file it should import.
Neptune Bulk Loader imports the data, using the given deterministic ids to prevent duplicating existing data.
Data Loader waits for results to come back from Neptune. Neptune will report any error messages that are incurred during import.

Our Gremlin Proxy in the Metadata Service handles reads and some low-rate writes from humans (tags, user descriptions, bookmarks, etc), similar to how the Atlas and Neo4j proxies work today. We made a few key changes:

Updated test framework
- In our version of the Metadata Service, we created an abstract_proxy_test.py file that we run against both the neo4j proxy and gremlin proxy (without mocking the database connections). These tests are also run in CI.
- This helps us prevent regressions in our code and make it easy for us to move between the two proxies. Plus, we can prevent badly formed queries and misunderstandings about expected results by testing against actual test databases.
Shared code in Amundsen Common
- Because both Data Loader and Metadata Service need to talk to Neptune, for the reasons stated above, we moved a lot of Neptune connection code into Amundsen Common.
- While this sounds less ideal, we made this decision after considering the tech debt we’d incur by writing mostly similar code in two different code bases.
An abstract gremlin proxy
- We originally ran tests locally against JanusGraph and in production against Neptune (happy to expand on why we stopped doing this, but the tl;dr is that they have enough subtle differences that it’s worth it to setup development Neptune instances).
- Because of this, we wrote an AbstractGremlinProxy that other, more specific proxies inherit from (i.e. JanusGraphProxy, NeptuneProxy). This may be useful for others in the Amundsen community.

Path Forward

With a focus on making the Gremlin Proxy usable as quickly as possible, I propose the following upstreaming plan, with each point below being an independent set of work:

Update the testing framework in the Metadata Service to run tests in an abstract proxy test file for databases that have been configured to run against this test file.
Discuss options for updating the CI in Amundsen’s Github to run the abstract proxy test file against all databases supported by Amundsen (to prevent regression or bugginess between proxies)
Add gremlin shared code / helper methods to Amundsen Common
Add abstract gremlin proxy code (plus smaller Neptune and JanusGraph proxy code)
Add configuration options for using Neptune with Amundsen and add documentation describing how to do the initial setup
Add Neptune Bulk Loader code to DataLoader and update documentation describing how to use this code

CC Lyft @jinhyukchang @feng-tao CC Square @friendtocephalopods @worldwise001 @kathawthorne

alran commented 4 years ago

We'd definitely love feedback on this proposal! In terms of timelines, I'm out next week but can answer questions when I get back. If the community thinks this makes sense, we will be able to dedicate time to upstreaming this work in Q3 (starting soon!)

feng-tao commented 4 years ago

thanks Alyssa, will take a look as well to learn more on the Neptune proxy design.

AndrewCiambrone commented 4 years ago

This looks great Alyssa! I was wondering during the data loading process when does old data get removed?

alran commented 4 years ago

During the data loading jobs, we update the last_updated_at timestamp. It shouldn't be too hard at that point to remove data that doesn't have the most recent date set (we don't currently do this, but its something we need to do)

feng-tao commented 4 years ago

GremlinNeptune Square's Approach.pdf

Let me also upload @alran 's doc here which includes the Arch diagram.

feng-tao commented 4 years ago

hey @alran ,

Sorry for the delay in review.

I think overall Neptune allows AWS manage graph instead of neo4j self serve. One main concern will be the migration. Here are also a few questions I have:

Is neptune HA?
what is the read / write throughput for Neptune? Does square's data store team provide Neptune as a service?
Does neptune provide a UI similar to Neo4j to explore graph relationship?
Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata? And how do you migrate your existing neo4j existing metadata to Neptune?
How big(# nodes and # relationship) is the graph in Square which causes the transactional issues? And what is the production graph at Square as of today.
Let us know if you plan to change shared interface which we could discuss if so.
Do you implement the ingestion loader through Airflow DAG?
Looking at the diagram, your approach seems to match what we have with databuilder which you load directly to neptune instead through metadata? Any reason not leveraging databuilder framework (let us know if that's you plan to)?

Thanks a lot and sorry for the delay in reviewing.

feng-tao commented 4 years ago

cc @jinhyukchang as an expect on neo4j loader.

feng-tao commented 4 years ago

also do you use Apache TinerPop(http://kelvinlawrence.net/book/PracticalGremlin.html) ?

alran commented 4 years ago

1. Is neptune HA? Yes! More info about this in the Neptune docs

2. What is the read / write throughput for Neptune? Reads and writes are different, and writes are different in bulk loading vs transactional. The bulk interface does something like 10K vertices a second (pretty fast). For transactional, even in a best case with kinesis/lambda we never got up to 1000’s a second. We got to high hundreds a second with lots of errors. We’ve never really tested the reads in this setup but it must be in the 1000s per second

3. Does Square's data store team provide Neptune as a service? AWS provides it as a service, so our data store team doesn’t do anything

4. Does Neptune provide a UI similar to Neo4j to explore graph relationships? Neptune has partners that do graph visualizations. We wired up Graphexp (on github)

5. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata? And how do you migrate your existing neo4j existing metadata to Neptune? It’s not possible to use Cypher with Neptune (but it IS possible to use Gremlin with Neo4j!). We were able to get a graphML dump from Neo4j and load it directly into Neptune using the bulk loader

Here is a guide on exporting a graphML dump from Neo4j
Here is a tool to convert the graphML file to a CSV file in the right format for the Neptune bulk loader
We use neptune-python-utils for authenticating to Neptune
In general, this amazon-neptune-tools repo has a lot of great content. We submodule it
Here is a blog post from AWS about moving from Neo4j to Neptune

6. How big (# nodes and # relationship) is the graph in Square which causes the transactional issues? And what is the production graph at Square as of today. Our graph is currently ~25M vertices and roughly the same number of edges. This fits in a 4XL or 2XL Neptune instance. A few weeks ago we turned it up to 4XL but watching it over the last week it’s been fairly idle so we will probably turn it down again. I don’t think the transactional issues were related to the size of the data so much as the write throughput. When we were doing the kinesis + lambda experiment, we exported data (generated CSV files that go into the bulk loader) and passed the resulting json objects into kinesis. Even though we were writing every node and edge exactly once, almost immediately we had concurrency exceptions. There would have been less than 100K nodes in the db at that point and we already had those exceptions

7. Let us know if you plan to change shared interface which we could discuss if so. It seems like a main shared interface that we’d be interested in changing is the way testing is set up in the metadata library. We have a concept of an “abstract_test_proxy” with tests that can run against any proxy (neo4j, gremlin, etc). This has helped us make sure that our new proxy was returning / doing exactly the same thing as the neo4j proxy. You’d still be able to override a specific proxy test if necessary. Was there anything else you were thinking about for this question? Am I misreading?

8. Do you implement the ingestion loader through Airflow DAG? No. But we could!

9. Looking at the diagram, your approach seems to match what we have with databuilder which you load directly to neptune instead through metadata? Any reason not leveraging databuilder framework (let us know if that's you plan to)? We have set this up inside our version of databuilder, so I think it should still work well with what exists in Amundsen! Jobs in our databuilder load the data directly in Neptune (which I think is the main difference)

10. Do you use Apache Tinkerpop? Yes! We really highly recommend the Practical Gremlin book you linked in your comment!

jornh commented 4 years ago

Regarding:

5. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata?

There is https://github.com/opencypher/cypher-for-gremlin though, which also mention Neptune with some test coverage. Haven't tried it though, and unsure if it's a good idea to wrap too many layers of different protocols "just because you can".

It seems the graph query language war isn't over yet. I noticed https://www.gqlstandards.org/ as a result of https://gql.today/. I assume it can be years out before a real standard emerges, let alone wide support.

Anyways, huge thanks for putting ☝️ info up here. This issue https://github.com/lyft/amundsen/issues/526 is becoming a good knowledge resource in and of itself.

Lastly, @alran with Path Forward in https://github.com/lyft/amundsen/issues/526#issue-645758230 any place you see others can contribute now/soonish, or do we just have to wait it becomes Q3 for y'all at Square? (We're potentially interested due to support in Azure via https://docs.microsoft.com/en-us/azure/cosmos-db/gremlin-support)

alran commented 4 years ago

We did actually try out cypher-for-gremlin. We found that (1) it didn't generate idiomatic gremlin and (2) even the idiomatic gremlin upserts don't work great with Neptune.

I'm happy to report that we are currently in Q3! It started at the beginning of July. My main goal here is to make sure that the Amundsen community is happy with our approach so that we can relatively quickly upstream our work, once we begin. We are finalizing timelines for planned work for this quarter within our team on Monday.

alran commented 4 years ago

FYI @friendtocephalopods is going to follow up here in the near future :)

friendtocephalopods commented 4 years ago

Hi @feng-tao and other gremlin-interested friends,

Here's a slightly more detailed plan I've been thinking about:

add abstract proxy test cases in metadataservice with a neo4j proxy test using new PUT/POST endpoints in neo4j proxy. This will allow folks to roundtrip against a test database locally. It's not perfect -- y'all use databuilder code to write neo4j, but maybe in the mid term neo4j folks could refactor into something like the amundsen-neo4j approach I'm planning for the gremlin proxy below, or even call these metadataservice endpoints from the databuilder. I think it's still more comprehensive than the mock test approach, but if you'd rather not have the PUT/POST code in the proxy I can skip adding the neo4j proxy tests. One caveat is that this test wouldn't run in CI without someone doing additional work to spin up a neo4j instance in the amundsen CI environment, but should still be useful locally. I can give suggestions if someone is trying to set it up but is having trouble; we did this internally when we still used neo4j in our CI.
Create a new amundsen-gremlin repository, maintained by Square. I think this could also be submoduled into the root amundsen meta-repo? This repository would contain all the code to interact with neptune and would be shared between metadataservice and databuilder.
Add a neptune proxy test in metadataservice utilizing the amundsen-gremlin functionality. This would only be useful for teams running their own neptune config; there would be no shared instance for amundsen CI at launch (we could look into spinning up a cluster in amundsen CI with AWS open source credits for some shared neptune instance in CI if many folks end up contributing to amundsen-gremlin).

For the moment I'm mostly wondering if you're interested in the abstract proxy tests for neo4j using PUT/POST endpoints in the neo4j proxy. It's some additional work but I'd be happy to do it as an example useful to non-neptune folks of what an impl of the abstract proxy would look like. What do you think?

feng-tao commented 4 years ago

I need to think a bit more about whether we need that for neo4j. I think one of the pain points is to do an integration test between all three services(FE, metadata, search). Testing neo4j directly is fine if we support multiple neo4j version, but currently, we haven't had any plan or bandwidth yet for that. Or could you share your pr (pseudo-code) of what exactly PUT/POST endpoint are you referring to? In terms of the other point, I don't think we have plan to refactor the neo4j proxy into its own the proxy itself is a thin layer connect to neo4j which we have atlas proxy as well. At Lyft, we run neo4j as a different service which just bootstraps the neo4j itself with Lyft deployment staff. Both ETL(Airflow/ databuilder) and metadata could connect to the service.
For gremlin proxy, are you planning to add it to amundsen-gremlin instead of metadata? If so, any reason? If not, I am curious what you plan to put into the repo. For the library or common sharable code, I think we could leverage amundsencommon repo?

friendtocephalopods commented 4 years ago

Hi Tao, thanks for your quick response!

Agreed, we haven't quite cracked a great integration test between all 3 services internally yet either, but we found adding roundtrip tests in the proxy to be a major help at catching bugs when doing heavy development on our proxy. It helped a little bit that in our case all of our neo4j write code was also in the proxy via PUT/POST endpoints. It's a bit more work when databuilder is doing all the writes, y'all would probably really want to do as I plan for the gremlin and pull the write code into a separate, shared thing in order to do a roundtrip test in the metadataservice.

Sample excerpt of one put function in our neo4j proxy:

    def put_column(self, *, table_uri: str, column: Column) -> None:
        """
        Update column with user-supplied data
        """
        column_uri: str = f'{table_uri}/{column.name}'

        with LocalTransaction(driver=self._driver, commit_every_run=False) as ltx:
            ltx.upsert(node={'key': column_uri,
                             'name': column.name,
                             'sort_order': column.sort_order,
                             'type': column.col_type},
                       node_type='Column')
            ltx.link(node1={'key': column_uri}, node_type1='Column',
                     node2={'key': table_uri}, node_type2='Table', relation_name='COLUMN',
                     direction=Direction.TWO_TO_ONE)

        # Add the description if present
        if column.description is not None:
            self.put_column_description(table_uri=table_uri, column_name=column.name, description=column.description)

I think if you pulled out the neo4j write code from databuilder into a separate package, it would probably look more like:

    def put_column(self, *, table_uri: str, column: Column) -> None:
        """
        for testing only, put a column
        """
       amundsen_neo4j.put_column(table_uri, column)

It looks like y'all use the csv loader? So maybe put_column creates a csv and calls the neo4j load function? Or maybe it would look more likecsv = amundsen_neo4j.transform_column(column), amundsen_neo4j.load(csv)

Certainly the easiest thing for me is not to upload the PUT/POST endpoints we used and leave it as an exercise to others to. It does have the unfortunate effect that there's only one reference implementation for the abstract proxy tests and it's still only useful to us for a while, but maybe our put/post endpoints would be more trouble than they're worth for others using databuilder.

I think we'd definitely need to pull it into some shared repo like amundsen-gremlin or amundsen-common because then it can be used in databuilder for writes and metadataservice for roundtrip testing. I think either would be reasonable, but it might be surprising if amundsen-common ended up being mostly gremlin-specific code. Our neptune bulk loader is about 2k lines of code, excluding some bits shared between databuilder and metadataservice we've already pulled into our fork of amundsen-common

feng-tao commented 4 years ago

sorry for the late reply, I am OOO most of the time this week except a few meetings. Will get back to the comment next week.

bolkedebruin commented 4 years ago

I like a Gremlin proxy certainly for its flexibility. Just a couple of caveats. In our use of Apache Atlas which relies on JanusGraph we found out that Gremlin interfacing tends to be slow, especially on searching the graph if that is required. Secondly, a Gremlin endpoint used to be available in Atlas, but was kicked out for security reasons. Basically, that endpoint allowed you to circumvent any authorization requirements required on the data. I suggest never to expose the Gremlin to proxy to the end user, it should only be used with particular API calls. I suggest verifying the performance of those endpoints elaborately and at scale. Finally, just a note to ourselves ;-), this probably warrants a permission model in Amundsen (hasPermission).

friendtocephalopods commented 4 years ago

Interesting context re: Atlas, thanks for sharing!!

So far we haven't seen any significant performance issues on the read side. We've basically switched to using vertex ids everywhere, which likely helps a lot.

Definitely we are not planning to expose any user-facing gremlin query endpoints (the very idea gives me indigestion).

ZhijieWang commented 4 years ago

@bolkedebruin when you say you Atlas with JanusGraph search is low, were you searching external elastic search index? Based our experience of developing something similar on DataStack Enterprise Graph (DSE), we may need to introduce detailed partitioning mechanism to your graph. The same applies for Azure CosmosDB, or the cost will be expensive. Also, CosmosDB integration will be a little tricky, as their GraphSon v3 format is slightly different from others.

Definitely don't give user direct access to gremlin query endpoint. Most graph db do not have detailed permission scheme (DSE has Vertex label level permission because each vertex label is a Cassandra table).

From my side, I am happy to help port some gremlin integration once the approach is clear.

friendtocephalopods commented 4 years ago

With the merging of the neptune proxy, I think we can close this ticket! WDYT @feng-tao ?

feng-tao commented 4 years ago

totally! thanks @friendtocephalopods @alran and everyone behind for the gremlin feature support!

amundsen-io / amundsen

feat: Support Neptune #526

Expected Behavior or Use Case

Background

Things That Didn’t Work

Our Architecture: Where We Landed

Path Forward