DataONEorg / slinky

Slinky, the DataONE Graph Store
Apache License 2.0
4 stars 4 forks source link

Engineering problems left over from d1lod work #9

Open amoeba opened 3 years ago

amoeba commented 3 years ago

@ThomasThelen and I touched base today to go over the existing codebase. I wanted to document some of the issues we talked about here for us and others to see. They're all really leftover technical debt from the initial GeoLink work on this:

  1. Co-reference resolution and IDs: The current system tries to re-use identifiers for people and (I think?) organizations when it has a reasonably high chance they're the same thing. We used random, opaque IDs for these since we didn't already have an identifier. The problem this created was, when we re-generated the graph from scratch (See 2), the opaque ID might change which might be problematic. I can think of a few solutions here but it's a thing to think about. What we do here might interact with our thoughts about other types of co-reference resolution.
  2. Re-generating the graph when we change the code or any triplification patterns: Under the current system, we wipe the entire graph and re-build it when we make changes to the codebase that affect triplifcation patterns. We do have mechanisms in place to use disk-cached metadata records to speed things up but it's still slow. We also don't have a system in place to rebuild the graph while still serving requests to the existing graph. I've been thinking that we might maintain a write-ahead log as a way to quickly re-build the graph.
  3. Search visibility: We danced around this in our first implementation by only triplifying publicly-visible content. This works well because most public content can be expected to stay public and the really sensitive stuff is usually inside the data objects which weren't triplifying. This is pretty reasonable but could be a lot better. We might be able to handle this if we wrap the SPARQL query engine in an HTTP API and handle object access at that level. A part of the problem is that it's hard to know how to map a single object to the triples we inserted into the graph about it. e.g., if Bryce and Tommy both assert the sky is blue and Bryce later decides he wants to recant his statement, what do we do?
  4. Logging/observability system: This was always clunkier than I'd like. The whole thing used up way more resources than the service itself and broke often. As we look to migrate forward (#3), I think we should strip all of this out and find a new approach. I'm sure a lot has changed since we built this and using k8s might mean this isn't really a slinky concern anymore and is really more of a k8s cluster thing.

If you read this and have any questions or additional items to add please do and I'll update this.

ThomasThelen commented 3 years ago

Regarding the fourth point-the ELK stack is one of the most popular logging stacks with Kubernetes. I'm in the middle of getting them up and running (without rqdashbioard) with the rest of the core services. Given the new deployment it'll be worthwhile to retest things like performance, but when it comes to logging stacks in K8 it looks like a tomato tomahto situation.