amundsen-io / amundsen

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
https://www.amundsen.io/amundsen/
Apache License 2.0
4.43k stars 961 forks source link

feat: Support Neptune #526

Closed alran closed 4 years ago

alran commented 4 years ago

Related Issue: #160 Pluggable Graph Database to Support Janus

Expected Behavior or Use Case

We have been working on integrating Neptune with our version of Amundsen. To do this, we built a gremlin proxy that can work across other gremlin-supported databases (i.e. JanusGraph, BlazeGraph, etc). Below I give an outline of the approach taken by Square to integrate our version of Amundsen with Neptune, and provide context and clarify design decisions.

Background

Things That Didn’t Work

Square spent a good amount of time experimenting with approaches to loading data in Neptune that did not ultimately work for our use case. We include them here as a warning for future contributors.

Our Architecture: Where We Landed

Eventually, we switched to using the Neptune Bulk Loader to write most data. At a high level, the flow looks like this:

  1. Data Loader gets data from our data sources
  2. Data Loader queries Neptune for existing edge and vertex information (i.e. if there is already a vertex that exists for some data, we get what we have, using deterministic vertex and edge IDs)
  3. Data Loader decides which data is correct (i.e. if we have a table vertex with some properties in Neptune that aren’t in the data we have from our data sources, we can write logic to decide which data is correct)
  4. Data Loader writes vertex and edge data to an S3 CSV file, in a specific format expected by the Neptune Bulk Loader.
  5. Data Loader sends a POST request directly to Neptune with information about the CSV file it should import.
  6. Neptune Bulk Loader imports the data, using the given deterministic ids to prevent duplicating existing data.
  7. Data Loader waits for results to come back from Neptune. Neptune will report any error messages that are incurred during import.

<Check out the architecture image in the PDF linked here>

Our Gremlin Proxy in the Metadata Service handles reads and some low-rate writes from humans (tags, user descriptions, bookmarks, etc), similar to how the Atlas and Neo4j proxies work today. We made a few key changes:

Path Forward

With a focus on making the Gremlin Proxy usable as quickly as possible, I propose the following upstreaming plan, with each point below being an independent set of work:

CC Lyft @jinhyukchang @feng-tao CC Square @friendtocephalopods @worldwise001 @kathawthorne

alran commented 4 years ago

We'd definitely love feedback on this proposal! In terms of timelines, I'm out next week but can answer questions when I get back. If the community thinks this makes sense, we will be able to dedicate time to upstreaming this work in Q3 (starting soon!)

feng-tao commented 4 years ago

thanks Alyssa, will take a look as well to learn more on the Neptune proxy design.

AndrewCiambrone commented 4 years ago

This looks great Alyssa! I was wondering during the data loading process when does old data get removed?

alran commented 4 years ago

During the data loading jobs, we update the last_updated_at timestamp. It shouldn't be too hard at that point to remove data that doesn't have the most recent date set (we don't currently do this, but its something we need to do)

feng-tao commented 4 years ago

GremlinNeptune Square's Approach.pdf

Let me also upload @alran 's doc here which includes the Arch diagram.

feng-tao commented 4 years ago

hey @alran ,

Sorry for the delay in review.

I think overall Neptune allows AWS manage graph instead of neo4j self serve. One main concern will be the migration. Here are also a few questions I have:

  1. Is neptune HA?
  2. what is the read / write throughput for Neptune? Does square's data store team provide Neptune as a service?
  3. Does neptune provide a UI similar to Neo4j to explore graph relationship?
  4. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata? And how do you migrate your existing neo4j existing metadata to Neptune?
  5. How big(# nodes and # relationship) is the graph in Square which causes the transactional issues? And what is the production graph at Square as of today.
  6. Let us know if you plan to change shared interface which we could discuss if so.
  7. Do you implement the ingestion loader through Airflow DAG?
  8. Looking at the diagram, your approach seems to match what we have with databuilder which you load directly to neptune instead through metadata? Any reason not leveraging databuilder framework (let us know if that's you plan to)?

Thanks a lot and sorry for the delay in reviewing.

feng-tao commented 4 years ago

cc @jinhyukchang as an expect on neo4j loader.

feng-tao commented 4 years ago

also do you use Apache TinerPop(http://kelvinlawrence.net/book/PracticalGremlin.html) ?

alran commented 4 years ago

1. Is neptune HA? Yes! More info about this in the Neptune docs

2. What is the read / write throughput for Neptune? Reads and writes are different, and writes are different in bulk loading vs transactional. The bulk interface does something like 10K vertices a second (pretty fast). For transactional, even in a best case with kinesis/lambda we never got up to 1000’s a second. We got to high hundreds a second with lots of errors. We’ve never really tested the reads in this setup but it must be in the 1000s per second

3. Does Square's data store team provide Neptune as a service? AWS provides it as a service, so our data store team doesn’t do anything

4. Does Neptune provide a UI similar to Neo4j to explore graph relationships? Neptune has partners that do graph visualizations. We wired up Graphexp (on github)

5. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata? And how do you migrate your existing neo4j existing metadata to Neptune? It’s not possible to use Cypher with Neptune (but it IS possible to use Gremlin with Neo4j!). We were able to get a graphML dump from Neo4j and load it directly into Neptune using the bulk loader

6. How big (# nodes and # relationship) is the graph in Square which causes the transactional issues? And what is the production graph at Square as of today. Our graph is currently ~25M vertices and roughly the same number of edges. This fits in a 4XL or 2XL Neptune instance. A few weeks ago we turned it up to 4XL but watching it over the last week it’s been fairly idle so we will probably turn it down again. I don’t think the transactional issues were related to the size of the data so much as the write throughput. When we were doing the kinesis + lambda experiment, we exported data (generated CSV files that go into the bulk loader) and passed the resulting json objects into kinesis. Even though we were writing every node and edge exactly once, almost immediately we had concurrency exceptions. There would have been less than 100K nodes in the db at that point and we already had those exceptions

7. Let us know if you plan to change shared interface which we could discuss if so. It seems like a main shared interface that we’d be interested in changing is the way testing is set up in the metadata library. We have a concept of an “abstract_test_proxy” with tests that can run against any proxy (neo4j, gremlin, etc). This has helped us make sure that our new proxy was returning / doing exactly the same thing as the neo4j proxy. You’d still be able to override a specific proxy test if necessary. Was there anything else you were thinking about for this question? Am I misreading?

8. Do you implement the ingestion loader through Airflow DAG? No. But we could!

9. Looking at the diagram, your approach seems to match what we have with databuilder which you load directly to neptune instead through metadata? Any reason not leveraging databuilder framework (let us know if that's you plan to)? We have set this up inside our version of databuilder, so I think it should still work well with what exists in Amundsen! Jobs in our databuilder load the data directly in Neptune (which I think is the main difference)

10. Do you use Apache Tinkerpop? Yes! We really highly recommend the Practical Gremlin book you linked in your comment!

jornh commented 4 years ago

Regarding:

5. Do you know if it is possible to use cypher on Neptune which could help to minimize the migration of metadata?

There is https://github.com/opencypher/cypher-for-gremlin though, which also mention Neptune with some test coverage. Haven't tried it though, and unsure if it's a good idea to wrap too many layers of different protocols "just because you can".

It seems the graph query language war isn't over yet. I noticed https://www.gqlstandards.org/ as a result of https://gql.today/. I assume it can be years out before a real standard emerges, let alone wide support.

Anyways, huge thanks for putting ☝️ info up here. This issue https://github.com/lyft/amundsen/issues/526 is becoming a good knowledge resource in and of itself.


Lastly, @alran with Path Forward in https://github.com/lyft/amundsen/issues/526#issue-645758230 any place you see others can contribute now/soonish, or do we just have to wait it becomes Q3 for y'all at Square? (We're potentially interested due to support in Azure via https://docs.microsoft.com/en-us/azure/cosmos-db/gremlin-support)

alran commented 4 years ago

We did actually try out cypher-for-gremlin. We found that (1) it didn't generate idiomatic gremlin and (2) even the idiomatic gremlin upserts don't work great with Neptune.

I'm happy to report that we are currently in Q3! It started at the beginning of July. My main goal here is to make sure that the Amundsen community is happy with our approach so that we can relatively quickly upstream our work, once we begin. We are finalizing timelines for planned work for this quarter within our team on Monday.

alran commented 4 years ago

FYI @friendtocephalopods is going to follow up here in the near future :)

friendtocephalopods commented 4 years ago

Hi @feng-tao and other gremlin-interested friends,

Here's a slightly more detailed plan I've been thinking about:

For the moment I'm mostly wondering if you're interested in the abstract proxy tests for neo4j using PUT/POST endpoints in the neo4j proxy. It's some additional work but I'd be happy to do it as an example useful to non-neptune folks of what an impl of the abstract proxy would look like. What do you think?

feng-tao commented 4 years ago
  1. I need to think a bit more about whether we need that for neo4j. I think one of the pain points is to do an integration test between all three services(FE, metadata, search). Testing neo4j directly is fine if we support multiple neo4j version, but currently, we haven't had any plan or bandwidth yet for that. Or could you share your pr (pseudo-code) of what exactly PUT/POST endpoint are you referring to? In terms of the other point, I don't think we have plan to refactor the neo4j proxy into its own the proxy itself is a thin layer connect to neo4j which we have atlas proxy as well. At Lyft, we run neo4j as a different service which just bootstraps the neo4j itself with Lyft deployment staff. Both ETL(Airflow/ databuilder) and metadata could connect to the service.

  2. For gremlin proxy, are you planning to add it to amundsen-gremlin instead of metadata? If so, any reason? If not, I am curious what you plan to put into the repo. For the library or common sharable code, I think we could leverage amundsencommon repo?

friendtocephalopods commented 4 years ago

Hi Tao, thanks for your quick response!

  1. Agreed, we haven't quite cracked a great integration test between all 3 services internally yet either, but we found adding roundtrip tests in the proxy to be a major help at catching bugs when doing heavy development on our proxy. It helped a little bit that in our case all of our neo4j write code was also in the proxy via PUT/POST endpoints. It's a bit more work when databuilder is doing all the writes, y'all would probably really want to do as I plan for the gremlin and pull the write code into a separate, shared thing in order to do a roundtrip test in the metadataservice.

Sample excerpt of one put function in our neo4j proxy:

    def put_column(self, *, table_uri: str, column: Column) -> None:
        """
        Update column with user-supplied data
        """
        column_uri: str = f'{table_uri}/{column.name}'

        with LocalTransaction(driver=self._driver, commit_every_run=False) as ltx:
            ltx.upsert(node={'key': column_uri,
                             'name': column.name,
                             'sort_order': column.sort_order,
                             'type': column.col_type},
                       node_type='Column')
            ltx.link(node1={'key': column_uri}, node_type1='Column',
                     node2={'key': table_uri}, node_type2='Table', relation_name='COLUMN',
                     direction=Direction.TWO_TO_ONE)

        # Add the description if present
        if column.description is not None:
            self.put_column_description(table_uri=table_uri, column_name=column.name, description=column.description)

I think if you pulled out the neo4j write code from databuilder into a separate package, it would probably look more like:

    def put_column(self, *, table_uri: str, column: Column) -> None:
        """
        for testing only, put a column
        """
       amundsen_neo4j.put_column(table_uri, column)

It looks like y'all use the csv loader? So maybe put_column creates a csv and calls the neo4j load function? Or maybe it would look more likecsv = amundsen_neo4j.transform_column(column), amundsen_neo4j.load(csv)

Certainly the easiest thing for me is not to upload the PUT/POST endpoints we used and leave it as an exercise to others to. It does have the unfortunate effect that there's only one reference implementation for the abstract proxy tests and it's still only useful to us for a while, but maybe our put/post endpoints would be more trouble than they're worth for others using databuilder.

  1. I think we'd definitely need to pull it into some shared repo like amundsen-gremlin or amundsen-common because then it can be used in databuilder for writes and metadataservice for roundtrip testing. I think either would be reasonable, but it might be surprising if amundsen-common ended up being mostly gremlin-specific code. Our neptune bulk loader is about 2k lines of code, excluding some bits shared between databuilder and metadataservice we've already pulled into our fork of amundsen-common
feng-tao commented 4 years ago

sorry for the late reply, I am OOO most of the time this week except a few meetings. Will get back to the comment next week.

bolkedebruin commented 4 years ago

I like a Gremlin proxy certainly for its flexibility. Just a couple of caveats. In our use of Apache Atlas which relies on JanusGraph we found out that Gremlin interfacing tends to be slow, especially on searching the graph if that is required. Secondly, a Gremlin endpoint used to be available in Atlas, but was kicked out for security reasons. Basically, that endpoint allowed you to circumvent any authorization requirements required on the data. I suggest never to expose the Gremlin to proxy to the end user, it should only be used with particular API calls. I suggest verifying the performance of those endpoints elaborately and at scale. Finally, just a note to ourselves ;-), this probably warrants a permission model in Amundsen (hasPermission).

friendtocephalopods commented 4 years ago

Interesting context re: Atlas, thanks for sharing!!

So far we haven't seen any significant performance issues on the read side. We've basically switched to using vertex ids everywhere, which likely helps a lot.

Definitely we are not planning to expose any user-facing gremlin query endpoints (the very idea gives me indigestion).

ZhijieWang commented 4 years ago

@bolkedebruin when you say you Atlas with JanusGraph search is low, were you searching external elastic search index? Based our experience of developing something similar on DataStack Enterprise Graph (DSE), we may need to introduce detailed partitioning mechanism to your graph. The same applies for Azure CosmosDB, or the cost will be expensive. Also, CosmosDB integration will be a little tricky, as their GraphSon v3 format is slightly different from others.

Definitely don't give user direct access to gremlin query endpoint. Most graph db do not have detailed permission scheme (DSE has Vertex label level permission because each vertex label is a Cassandra table).

From my side, I am happy to help port some gremlin integration once the approach is clear.

friendtocephalopods commented 4 years ago

With the merging of the neptune proxy, I think we can close this ticket! WDYT @feng-tao ?

feng-tao commented 4 years ago

totally! thanks @friendtocephalopods @alran and everyone behind for the gremlin feature support!