IMAP-Science-Operations-Center / sds-data-manager

MIT License
0 stars 9 forks source link

SPIKE - Infra design 2: process tracker and updater #72

Closed maxinelasp closed 9 months ago

maxinelasp commented 1 year ago

Description

We need some system to watch the processing steps and keep track of data in the system. This is solution 3 in #68. This spike should expand on a few ideas on ways to track processing, and give a few examples for how that system might be used, such as in an API to track data processing.

Requirements

Nice to have or Goal Requirements

Additional notes

Existing design

The existing design is a simple relational database, with one row corresponding to each file. The row is updated when the file lands in S3, and is updated with each step in the processing. It keeps track of where the processing is, and is used for reprocessing if the system hangs on a file. There is an intermediate process updater which is used for organizing updates to the database, and an external "data watcher" system which is used for monitoring and to read from the database.

Related tickets

Follow up tickets

This spike is not considered complete until at least one follow up issue is created.

Below is the template for the response to this ticket. Add as many solutions as needed, but preferably include 2-5 for discussion. The response should be posted as a comment on this issue, or linked in a comment.

Solution 1

Write an overview of the solution here.

Pros:

Cons:

Additional notes:

Solution 2

Write an overview of the solution here.

Pros:

Cons:

Additional notes:

Summary

Write up a summary of your findings, including your preferred solution.

tech3371 commented 1 year ago

I can take this too.

maxinelasp commented 11 months ago

Apologies for the wall of text. This was written by Tenzin, Sean, and myself. I will schedule a meeting for further discussion to answer the general requirement questions. Feel free to add any thoughts you have before our meeting as well.

General things to discuss

  1. What do we want out of the data tracker system and database?
  2. What do we want out of the metadata system and database?
  3. What are the similarities and differences between the systems? Can we combine them, or coordinate between them?
  4. What are the access patterns for the data tracker system? internal and external
  5. What are the access patterns for metadata information? internal and external
  6. What kind of statistics do we want for the SDC? What statistics might be useful for external viewers? (EMM statistics page)

Requirements For processing database (By Maxine)

The processing database should store information about the status of processing data. Here are some ways we might use this table:

  1. Processing jobs which depend on other files can access the status of those files via the processing database. This is probably the number one way we will use this database
  2. It is possible that our triggering mechanism will check this database before triggering a step in the pipeline. This is a very similar use case to 1
  3. For monitoring data processing, we can use the database to check across multiple search terms (for example, we can query if all L1b processing is completed for all instruments)
  4. For monitoring data processing, we can track the time it takes for each processing step. This will help us fix inefficient steps or check for failing jobs.
  5. (maybe) tracking file dependencies to allow for reprocessing

The main access method is probably going to be a specific file, or a specific time, instrument, and level. (to generate the file name we need that information as well, so maybe dividing keys based on those three things makes sense.)

Primary key: Instrument or Level Secondary Key: Instrument or Level Secondary Key: time

We want the keys to divide the data relatively easily. If there is a significant difference in the number of processing steps or files for different instruments, we should do something else.

Should we delete older processing information from the database? This would mean we would need to check the metadata store for some cases, but most likely, the only relevant info from this processing database is going to be very recent

Why can't we just see if files exist for this database? -> can capture time ranges or get multiple instruments, can track time and store averages from the past month etc

Requirements for metadata database (By Maxine)

The metadata database should include general file information for each file which is uploaded to AWS. This should include things like the upload source, the timestamps, the mission tags, etc.

This database is used for the following things:

  1. To check on specific files to get specific information on them (such as size)
  2. To trigger jobs based on files landing in the metadata DB?
  3. To get general information on averages across all files? (size?)

Can we combine metadata and processing databases into one data file tracking database and system? i.e. database contains: file metadata, processing status, maybe dependencies, file instrument/level/time

Solution 1 - DynamoDB (written by Tenzin)

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and flexible storage for applications. It offers seamless scalability and high availability while automatically handling the complexities of hardware provisioning, setup, and maintenance. DynamoDB supports both document and key-value data models, making it suitable for a wide range of applications. With features like global tables for cross-region replication, fine-grained access control, and on-demand scaling, DynamoDB empowers developers to build responsive and reliable applications that can handle variable workloads and evolving data requirements. It's an essential tool for modern cloud-based applications that require efficient data storage and retrieval. (ChatGPT)

Pros:

Cons:

Additional notes:

Solution 2 - OpenSearch (written by Tenzin and Sean)

OpenSearch is an open-source, distributed search and analytics engine derived from Elasticsearch. It offers powerful full-text search capabilities, real-time data indexing, and advanced analytics features. Designed for scalability and high performance, OpenSearch can process and analyze vast amounts of data across diverse sources. With a rich ecosystem of plugins and integrations, it's adaptable to various use cases, from log and event analysis to application monitoring. OpenSearch provides organizations with the ability to extract valuable insights from their data and build applications that require fast, reliable, and customizable search functionality. (ChatGPT)

Pros:

Cons:

Additional notes:

Solution 3 - Relational database (RDS or Athena) (Written by Maxine)

A relational database is the classic SQL style for databases. It has a heavily structured schema and it is difficult to update the data structure. However, it provides powerful tools for filtering, sorting, and combining structured data.

Pros:

Cons:

Additional notes:

If we aren't sure what the access pattern will look like, or if the data will get accessed in a lot of different ways for different purposes, it can be more flexible than DynamoDB, but it's more structured than OpenSearch.

Solution 4 - DocumentDB (Written by Maxine)

"A document database is a type of nonrelational database that is designed to store and query data as JSON-like documents. Document databases make it easier for developers to store and query data in a database by using the same document-model format they use in their application code. The flexible, semistructured, and hierarchical nature of documents and document databases allows them to evolve with applications’ needs. The document model works well with use cases such as catalogs, user profiles, and content management systems where each document is unique and evolves over time. Document databases enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents." - AWS DocumentDB docs

Pros:

Cons:

Summary

Tenzin: I think DynamoDB seems like the right choice for this. It is supported by large community and is easy to setup and manage. It support scalability and flexbility that we need. It's easy to read and write data from DynamoDB. It is easy to integrate into other tools.

Maxine: I think if we determine the proper requirments and scope for how we want to use these data stores and what the access patterns will look like, it will become obvious which option is the easiest and most correct. These options align with different styles of data organization quite well, so if we determine our access patterns, they will probably only line up with one of these options. It is likely that opensearch will be applicable to all the different access patterns because it is so flexible, but it also has some drawbacks. It can be a good option if we really want that flexibility.

greglucas commented 11 months ago

I really appreciate the documentation and summary of all of this! I think this is really good for discussion. Here are some quick thoughts and additional questions I had.

Can we combine metadata and processing databases into one data file tracking database and system? i.e. database contains: file metadata, processing status, maybe dependencies, file instrument/level/time

I think this is a fantastic question and one I have also wondered about. We should think about what metadata means to us and what metadata we need for different tasks: is it file metadata, processing metadata, or something else...?

What questions do you need to ask the database vs. what questions can you ask the s3 bucket/object? Should we store the file size and all of that in a DB table or can we ask the database what files match our criteria, then go get all that metadata from the s3 objects themselves? I agree with us needing to figure out what questions we are planning to ask as being critical to the design decision here.

When do we plan on writing/updating to these tables? Do we write to the table when a processing job starts adding rows for the files it is expecting to produce or is there only one row for the processing job itself, maybe a field that could be updated with "products_created" or something like that instead of multiple rows?

What/Who is going to be updating these tables? Do we anticipate the Step Function to be updating the table, the Lambda jobs, an s3 bucket event? I'm not sure this will really matter in the database design, but it might be useful in thinking about what we want to store within the tables based on what information we have within these different services.

maxinelasp commented 9 months ago

Closing this issue, as we determined that this will be handled using RDS and possibly instrument-specific tables, along with CloudWatch #185 #173