maxinelasp commented 1 year ago

Description

Create a document describing some of the goals and requirements for a system to track how data moves throughout AWS. This should, for example, be able to indicate when data is ready for processing, when data is completed for processing and ready for archiving, and other pieces like this.

Requirements

System for indicating when multiple data files are ready for processing
The system should be able to trigger processing jobs as soon as all the data is uploaded or processed
This is an initial discussion of how the system would work, so just the basics need to be covered, with an overall design and a few ideas for technologies to use
Set up a meeting for further discussion by the team for refining of the diagram

Nice to have or Goal Requirements

Call out a few different styles of data management using different technologies (eg SQS vs Database vs AWS Datalake)

Additional notes

The end result of the meeting should be an overall design doc for the philosophy and initial ideas for the design of the data manager.

Follow up tickets

This spike is not considered complete until at least one follow up issue is created.

[ ] Placeholder ticket

Below is the template for the response to this ticket. Add as many solutions as needed, but preferably include 2-5 for discussion. The response should be posted as a comment on this issue, or linked in a comment.

Solution 1

Write an overview of the solution here.

Pros:

pro 1

Cons:

con 1

Additional notes:

note

Solution 2

Write an overview of the solution here.

Pros:

pro 1

Cons:

con 1

Additional notes:

note

Summary

Write up a summary of your findings, including your preferred solution.

greglucas commented 1 year ago

@maxinelasp, There is a somewhat detailed document here for some of the designs that we can also use for reference: https://lasp.colorado.edu/galaxy/display/IMAP/SDC+Infrastructure+Detailed+Design Note this also has some information about backups in it as well.

maxinelasp commented 1 year ago

I'm going to write some suggestions for changes we can consider for the design laid out in the design document. I am also trying to call out additional concerns that may or many not need to be addressed.

The goal isn't to have these suggestions compete, but rather to start a discussion about potential design changes or overall philosiphy. I also included some notes on some AWS tools that may be useful that I explored.

Solution 1

Right now, the file ingester serves as the entry point to the system, and also does a lot of work writing to databases and managing the data. I suggest splitting this up more to put less responsibility on the ingester, and make the system more flexible.

The ingester can continue to read the manifest file and do the checks on each file, since it will be easiest to confirm all the data is there in the first step.
Rather than then moving the data, perhaps it could trigger another system for moving files (Glue might be a good match for this.) This system could move each file, perhaps do some data verification or cleanup if needed at this step, and trigger processing one file at a time.

Pros:

This will reduce the responsibilities of the ingester and make it faster for responding to incorrect uploads or manifest files.
It could also provide an additional layer for tracking and logging, without having to add monitoring to the ingester. (The current design would require a fair amount of reporting out)

Cons:

Adds some complication, may not be needed depending on our requirements or how fast the ingester ends up being.

Solution 2

Instead of having the ingester trigger processing, we could have an S3 watcher output new files landing to an SQS and SNS messaging system. This would allow for any number of events to be triggered when new data lands. In particular, we could update the necessary databases (detailed in solution 3, although we could use one solution and not the other), trigger processing via step functions, and output information to CloudWatch for monitoring purposes.

Pros:

Once again, this reduces responsibility for the ingester
It allows for the very easy addition or removal of processing steps or additional databases
Monitoring is made external to the ingester
It would allow us to swap out technologies easily (for example, using glue for metadata instead of opensearch (solution 4))
really easy for batching or reprocessing, batching/scaling is easier,
can store messages if processing fails or stalls

Cons:

The SQS queue might end up doing too much, having seperate queues for each process might make more sense
it's another process that might be covered by a step function, depending on how we set it up

Solution 3

Rather than having the ingester write out to the three tables, we could have one generic database updater which reads from SQS messages to update databases as needed. Right now, there are three databases (or maybe 2? Not sure if the metadata db was replaced by opensearch.) Presumably, there will be more for instrument-specific needs. At the very least, the processing jobs table would benefit from this design, since different processing jobs will be providing updates. Each processing job and the ingestor could send updates to the database to SQS. Then, these messages could be read in order by a database writer, ensuring that there is no conflict, that each message gets updated, and allowing for easy duplication if other processes need to be updated (for example, later steps in the processing pipeline could read events from the same SQS)

Pros:

decouples database from processes
provides a strict order for events so there won't be conflicts if two services are writing to the DB at the same time
any events that are written to the database might be useful to trigger other processes, so those events can be easily duplicated
removes responsibility from the ingestor, and duplicate work from the processing functions

Cons:

Additional point of failure for data loss
DynamoDB can probably handle all the applications writing and reading from it anyway

Solution 4

A full view of the system can be helpful for monitoring and general data management (for example, checking on dependencies, notifying failures, etc) It might be worth creating something to be a single source of truth for this system which can then report out as needed. Probably, this is almost entirely covered by the processing database and the metadata source, but we could add an additional service to act as a data collector, which adds a layer of abstraction between the processes that need that information and the processing database.

We could also consider using a glue crawler to manage the metadata and give a good overview of the data state.

Pros:

Not that complicated with current system
Would allow for additional sources of processing information that we might add in the future
Abstracts and decouples the processing database
Could provide monitoring and replace the process monitor (or augment it, however you want to look at it)

Cons:

If the processing database is all the info we will ever need, it's unnecessary

Solution 5

Amazon Glue for metadata creation and storage rather than manually creating metadata databases. Amazon Glue provides a crawler which can automatically generate metadata stores for data tracking.

Pros:

Runs automatically
Can run on data after it is put on the bucket, so it's seperate from the data processing pipeline

Cons:

Requires a specific Glue data store rather than using dynamoDB etc
It might not contain all the data we would like to store about a file

Potentially useful AWS things

AWS analytics services

SNS and SQS
- Two related systems for messaging
- options for guarenteed once only delivery, strict or non strict ordering, etc
- Can be duplicated and sent to multiple services
- Can have a "dead letter queue" for unprocessed or failed messages
OpenSearch
- metadata store? easy search
Lake formation
- Automates moving, collecting, securing, and cataloging data
- Start with S3 or databases, crawl the data, and set up analytics services
- Allows for tracking of metadata
- pros: helps manage IAM permissions and metadata, cons: doesn't do much else? Not a good fit
AWS Glue quality
- Automatic statistics for data quality
- Can check data quality before and after movement, through an entire pipeline
- Cons: In open preview/beta, data freshness rules of thumb might not work for our datasets, can't evalute lists/nested data sources
- Pros: You can set up your own rules or adjust the existing ones
AWS Glue
- create a central data catalog, and create ETL pipelines for loading data
- Pros: Flexible, serverless, scalable, well-integrated
- Cons: Complicated
- Glue studio: can view data and create a GUI style for managing it
- Might be able to replace or suppliment the ingester

AWS Glue:

3 major categories of features:

Discover/organize
- Search across multple data stores
- Automatically discover data using crawlers -> May be useful for first ingestion step?
- Manage schemas and permissions
transform and clean
- Drag and drop GUI
- Invoke glue jobs on a schedule, on demand, or with an event
build and monitor data pipelines.
- Autmatically scale
- Event-based triggers
- Monitor jobs

Step functions probably win for processing (more powerful, can run in parallel), but glue might be good for the initial data moving around and putting into databases (ie the ingestor process) or for cleaning data.

Crawler can automatically populate metadata tables

It can be used to extract, transform and then load the data. That might be useful for processing steps.

Needs to use a Glue Data Catalog for metadata

tl;dr MICROSERVICES

Visual Aid

Proposed_IMAP_design_changes drawio

Current design

image2022-7-1_12-40-1

maxinelasp commented 1 year ago

Additional info from discussion:

Amazon EventBridge

Events can be generated by AWS environments or by applications, or on a schedule.
- Examples of events
- Most services only have "best effort delivery", but S3 has guaranteed delivery
Pre-designed event formats
Can filter based on different event patterns
Can send one event to multiple targets based on filters and rules
Can use an SQS as a dead letter queue

EventBridge vs SQS and SNS

Comparison from AWS
EventBridge can replay and save events like SQS
It has a deadletter queue like SQS or SNS
It can fan out to multiple targets like SNS
You don't get much control over delivery type (services either have guarenteed or best effort)
Scaling is more difficult than SQS
You can filter and transform events, unlike SQS
In our case, it would replace the SQS + SNS pattern with one thing, but there are some SQS features that are not available
- Rather than a Queue, it acts as an event bus, and is more of a replacement for SNS
EventBridge AMI permissions seem complicated
No ordering of events
Unlike SQS, doesn't require copying of queues, but also doesn't have the same waiting for processing pieces
Nice direct comparison

In my opinion.... either SQS or EventBridge could work. SQS probably needs some additional set up to create messages, whereas EventBridge can automatically generate them. Most of the advantages of EventBridge don't seem like they are needed (eg schema generation) and it more acts as a replacement for SNS. While it might be useful for some aspects of our design, in my opinion SNS isn't really the right tool for the job here. (Basically, SNS is for multiple subscribers to recieve messages, and it basically just fires off into the void. Most of the pieces I propose a messaging system for would benefit from the dead letter queue to indicate if processing succeeded or not, and latency doesn't matter enough to make a difference.)

From reading through it, SQS has some advantages that EventBridge doesn't have, so unless EventBridge is significantly easier to use, I'd prefer SQS.

laspsandoval commented 1 year ago

This is what we have found from comparing EventBridge to using SNS and SQS:

SQS can be ordered, but only using FIFO. And FIFO queues and topics are incompatible with S3 event notifications.
Without FIFO SQS doesn’t reliably deliver batched files. For example if there are 5 new object events that are pushed into the SQS, when the SQS gets polled, it will return a random assortment of those files (at least one, almost never all five).
We have planned (but not implemented) the following: using EventBridge (scheduled, not event driven) and indexing the S3 objects in DynamoDB so that every S3 object created event triggers a Lambda that indexes that file in Dynamo.

tech3371 commented 1 year ago

Very detailed work. Looking forward to discussion tomorrow!

maxinelasp commented 1 year ago

Here, I will attempt to record all the things that came up in our meeting as potential concerns or things to address for future spikes. We decided that the next steps are a more specific design for different pieces of the system (so, for solutions 2, 3, and 4, more specific descriptions of what each part of that system needs to do, and recommendations for technologies to use.)

Overall, the team was interested in keeping the individual pieces of the design fairly small and specialized.
There was asome discussion into what would happen if processing failures occurred. Overall, there are two ideas here: one is to make sure the processing succeeds in as many cases as possible, by filtering out specific failures ahead of time or making sure that the data is very thoroughly checked for preparedness. This also might include having a pre-processing step which can check for data existence ahead of time and fail instead of the whole pipeline. Or, we can have the whole pipeline fail, and try and have short-circuit steps to have it fail quickly. This would mean the processing needs to be more flexible, but there is less infrastructure checking on all the data before attempting processing.
As a team, we should decide on if we should focus more on the input data, the processing data, or the output data. In particular, we should pick one segment to prioritize and design the other two around the first segment. Deciding our priorities early will help us stay focused.
There is also a decision to make between flexibility, reliability, and simplicity. In general, do we prefer solutions to be flexible? or reliable? This will inform the choices in later spikes.
There was some discussion around SQS and eventbridge, which basically boiled down to having a more batch style processing (EventBridge can reliably deliver multiple events) vs having more event driven processing (SQS is less reliable, but can be horizontally scaled.) This will need to be explored further in a later spike, but it seems like eventbridge has some advantages which will better align with the style of processing. So, in the diagram, basically replace all the SQS steps with EventBridge instead. We will still need to decide what the events actually contain (later spike)
It would probably be useful to have some kind of API on the data watcher to allow teams to check on data processing info (or at the very least let the SDC do that)
Finally, there was a philosophical question about when we do the processing. Is it better to wait until the last second, where we know we (probably) have all the data? Or attempt to process quickly, and have intentional failure points while we wait for additional dependencies?

maxinelasp commented 1 year ago

Created tickets #71, #72, #73, and #74 for additional work.

IMAP-Science-Operations-Center / sds-data-manager

SPIKE - Start discussions on Data Tracker #68

Description

Requirements

Nice to have or Goal Requirements

Additional notes

Follow up tickets

Solution 1

Pros:

Cons:

Additional notes:

Solution 2

Pros:

Cons:

Additional notes:

Summary

Solution 1

Pros:

Cons:

Solution 2

Pros:

Cons:

Solution 3

Pros:

Cons:

Solution 4

Pros:

Cons:

Solution 5

Pros:

Cons:

Potentially useful AWS things

AWS Glue:

Visual Aid

Current design

Amazon EventBridge

EventBridge vs SQS and SNS