Closed maxinelasp closed 1 year ago
@maxinelasp, There is a somewhat detailed document here for some of the designs that we can also use for reference: https://lasp.colorado.edu/galaxy/display/IMAP/SDC+Infrastructure+Detailed+Design Note this also has some information about backups in it as well.
I'm going to write some suggestions for changes we can consider for the design laid out in the design document. I am also trying to call out additional concerns that may or many not need to be addressed.
The goal isn't to have these suggestions compete, but rather to start a discussion about potential design changes or overall philosiphy. I also included some notes on some AWS tools that may be useful that I explored.
Right now, the file ingester serves as the entry point to the system, and also does a lot of work writing to databases and managing the data. I suggest splitting this up more to put less responsibility on the ingester, and make the system more flexible.
Instead of having the ingester trigger processing, we could have an S3 watcher output new files landing to an SQS and SNS messaging system. This would allow for any number of events to be triggered when new data lands. In particular, we could update the necessary databases (detailed in solution 3, although we could use one solution and not the other), trigger processing via step functions, and output information to CloudWatch for monitoring purposes.
Rather than having the ingester write out to the three tables, we could have one generic database updater which reads from SQS messages to update databases as needed. Right now, there are three databases (or maybe 2? Not sure if the metadata db was replaced by opensearch.) Presumably, there will be more for instrument-specific needs. At the very least, the processing jobs table would benefit from this design, since different processing jobs will be providing updates. Each processing job and the ingestor could send updates to the database to SQS. Then, these messages could be read in order by a database writer, ensuring that there is no conflict, that each message gets updated, and allowing for easy duplication if other processes need to be updated (for example, later steps in the processing pipeline could read events from the same SQS)
A full view of the system can be helpful for monitoring and general data management (for example, checking on dependencies, notifying failures, etc) It might be worth creating something to be a single source of truth for this system which can then report out as needed. Probably, this is almost entirely covered by the processing database and the metadata source, but we could add an additional service to act as a data collector, which adds a layer of abstraction between the processes that need that information and the processing database.
We could also consider using a glue crawler to manage the metadata and give a good overview of the data state.
Amazon Glue for metadata creation and storage rather than manually creating metadata databases. Amazon Glue provides a crawler which can automatically generate metadata stores for data tracking.
3 major categories of features:
Step functions probably win for processing (more powerful, can run in parallel), but glue might be good for the initial data moving around and putting into databases (ie the ingestor process) or for cleaning data.
Crawler can automatically populate metadata tables
It can be used to extract, transform and then load the data. That might be useful for processing steps.
Needs to use a Glue Data Catalog for metadata
tl;dr MICROSERVICES
Additional info from discussion:
In my opinion.... either SQS or EventBridge could work. SQS probably needs some additional set up to create messages, whereas EventBridge can automatically generate them. Most of the advantages of EventBridge don't seem like they are needed (eg schema generation) and it more acts as a replacement for SNS. While it might be useful for some aspects of our design, in my opinion SNS isn't really the right tool for the job here. (Basically, SNS is for multiple subscribers to recieve messages, and it basically just fires off into the void. Most of the pieces I propose a messaging system for would benefit from the dead letter queue to indicate if processing succeeded or not, and latency doesn't matter enough to make a difference.)
From reading through it, SQS has some advantages that EventBridge doesn't have, so unless EventBridge is significantly easier to use, I'd prefer SQS.
This is what we have found from comparing EventBridge to using SNS and SQS:
Very detailed work. Looking forward to discussion tomorrow!
Here, I will attempt to record all the things that came up in our meeting as potential concerns or things to address for future spikes. We decided that the next steps are a more specific design for different pieces of the system (so, for solutions 2, 3, and 4, more specific descriptions of what each part of that system needs to do, and recommendations for technologies to use.)
Created tickets #71, #72, #73, and #74 for additional work.
Description
Create a document describing some of the goals and requirements for a system to track how data moves throughout AWS. This should, for example, be able to indicate when data is ready for processing, when data is completed for processing and ready for archiving, and other pieces like this.
Requirements
Nice to have or Goal Requirements
Additional notes
The end result of the meeting should be an overall design doc for the philosophy and initial ideas for the design of the data manager.
Follow up tickets
This spike is not considered complete until at least one follow up issue is created.
Below is the template for the response to this ticket. Add as many solutions as needed, but preferably include 2-5 for discussion. The response should be posted as a comment on this issue, or linked in a comment.
Solution 1
Write an overview of the solution here.
Pros:
Cons:
Additional notes:
Solution 2
Write an overview of the solution here.
Pros:
Cons:
Additional notes:
Summary
Write up a summary of your findings, including your preferred solution.