NationalSecurityAgency / datawave

DataWave is an ingest/query framework that leverages Apache Accumulo to provide fast, secure data access.
https://code.nsa.gov/datawave
Apache License 2.0
563 stars 244 forks source link

Query microservices #1026

Open ivakegg opened 3 years ago

ivakegg commented 3 years ago

The query executor and various query logics needs to be split into microservices such that we can scale up/down each query logic (or executor) separately. The initial design will look something like this:

Query State storage: This will store the current state of the query such that any instance of a query logic execution service can pick up a query where it left off. This will store the query, the list of ranges with query plans, and the last range+result that was placed on the results queue. Results Queue: This is a (maybe persistent) queue per query on which results will be dropped onto Query Lock storage: This is a mechanism by which the query execution services can control which instance is actually handling a query. Query API Service: This service will accept the existing create/next/close client calls and will store the request in a Query State storage. This service will pull results from the results queue to return pages to the client Query Logic X Execution Service: This service will handle creating query plans and pulling results for a query. Query ranges+plans will be stored in the Query State storage. The results will be placed on a query specific Results Queue.

ivakegg commented 3 years ago

A discussion with @FineAndDandy centered around being able to handle multiple ranges out of the range stream at the same time. So if we have an execution service scaled up enough, then we could be handling the execution of multiple ranges at the same time for one query. This means the lock storage would need to handle non-binary semaphores. We could have a pluggable mechanism that would help determine how many locks are available for a specific query which could be computed based on the range/size of the query and the number of execution services available.

jwomeara commented 3 years ago

We could create individual entries in our caching structure for each of the ranges associated with the query instead of lumping it all together into a single entry.

jwomeara commented 3 years ago

Thinking about the caching layer a bit this morning. We are already using Hazelcast cache, but there are alternatives. Whatever we use, I have been thinking that we might want the option to persist our cached entries (i.e. the query state) to disk so that we don't lose everything if the caching layer goes down. Were you thinking that as well @ivakegg?

If we go with Hazelcast we will need to configure a MapStore for kay/value pairs that we want to be persisted (info here: https://docs.hazelcast.org/docs/latest/manual/html-single/#loading-and-storing-persistent-data). This gives us the flexibility to use any backing store we want (Accumulo, HDFS, possibly Kafka), but we will have to code that up. I'm also a little bit worried about the idea of multiple different queries causing backend writes at will. They have a feature which allows you to coalesce entries for a single key into one entry before persisting, but I didn't see anything about persisting entries in bulk.

Interested in hearing what you think.

ivakegg commented 3 years ago

Yes I was thinking that the query state needs to be persistent or at least have the option to be. Short running queries may not need to be. Long running event queries (greater than N minutes?) probably need to be persisted. So it would be nice to find a layer that allows the client to define when something needs to be persisted, or can be configured to persist after some age limit.

The result queues are the same thing and may not need to be depending on the type of query. LookupUUID may not need persistent results for example. Event queries that take 45 minutes per page may need some persistence. Not as sure about that.

jwomeara commented 3 years ago

Yeah, we control the logic in the Hazelcast MapStore, so we can decide when messages need to be persisted. I guess that's one advantage of using a solution which requires you to roll your own store functionality.

Per your other comment, since we will be defining the result queues programmatically we can absolutely control which ones are durable/replicated and which ones aren't. As a side note, I would really like to take a look at Kafka if time permits. I think that could be useful in a few cases.

ivakegg commented 3 years ago

Please do look at Kafka. We need to do a reasonable trade study here.

dmgall commented 3 years ago

I have a question on this idea. How do we keep a different issue (ie – network connection issue) from coalescing with a time limit (assuming that is what we go with) and wrecking mass havoc on the store? I just want to make sure that we don’t create a confounding problem when we are debugging the first issue. I’m not saying I am against it either, just trying to play devil’s advocate here.

From: Ivan Bella notifications@github.com Sent: Thursday, January 14, 2021 8:24 AM To: NationalSecurityAgency/datawave datawave@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [NationalSecurityAgency/datawave] Query microservices (#1026)

Please do look at Kafka. We need to do a reasonable trade study here.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/NationalSecurityAgency/datawave/issues/1026#issuecomment-760194184, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARHYHYDGHR5VKH3E52XZPGLSZ3V6ZANCNFSM4VHHZMVA.

ivakegg commented 3 years ago

So the scenario is if we have a rule that the query state does not get stored unless the query runs over a configured amount of time, and then we have a networking or other issue that causes all queries to stall. In this case we need to ensure we won't have a storage issue or do something that compounds the system issues.

I expect the short run query logics (e.g. lookup uuid) will be configured to never use a backing store. For the event and other long running queries we will have to ensure there is enough storage to handle everything in this scenario. Note that this storage is temporary in that it would be cleaned up after a query completes. Also this state does not include results and hence should be relatively small.

jwomeara commented 3 years ago

Requests from Whitney: