maxinelasp commented 1 year ago

Description

We need some system to watch the processing steps and keep track of data in the system. This is solution 3 in #68. This spike should expand on a few ideas on ways to track processing, and give a few examples for how that system might be used, such as in an API to track data processing.

Requirements

There should be some way to check on a given process or file (eg to see what stage the processing is in)
There should be some plan for managing multiple systems accessing the process tracker to read and write at the same time

Nice to have or Goal Requirements

Some flexibility to include additional tracking or data storage
An example of the kind of data included in the tracker

Additional notes

This is related to #71, and may depend on the outcome from that spike. At the very least, keep in mind that the processing and data movement plans are still in flux.
The current design just uses one relational database, but it would be ideal to explore some other options for tracking processing (CloudWatch, a noSQL database, multiple databases?)
There is another ticket for expanding on the data processing API (#placeholder). So don't go too in depth into what that looks like. However, it would be good to keep in mind what kind of data instrument teams and the SDC in general might want out of the processing tracker.

Existing design

The existing design is a simple relational database, with one row corresponding to each file. The row is updated when the file lands in S3, and is updated with each step in the processing. It keeps track of where the processing is, and is used for reprocessing if the system hangs on a file. There is an intermediate process updater which is used for organizing updates to the database, and an external "data watcher" system which is used for monitoring and to read from the database.

Related tickets

68
61
71

Follow up tickets

This spike is not considered complete until at least one follow up issue is created.

[x] #185

Below is the template for the response to this ticket. Add as many solutions as needed, but preferably include 2-5 for discussion. The response should be posted as a comment on this issue, or linked in a comment.

Solution 1

Write an overview of the solution here.

Pros:

pro 1

Cons:

con 1

Additional notes:

note

Solution 2

Write an overview of the solution here.

Pros:

pro 1

Cons:

con 1

Additional notes:

note

Summary

Write up a summary of your findings, including your preferred solution.

tech3371 commented 1 year ago

I can take this too.

maxinelasp commented 11 months ago

Apologies for the wall of text. This was written by Tenzin, Sean, and myself. I will schedule a meeting for further discussion to answer the general requirement questions. Feel free to add any thoughts you have before our meeting as well.

General things to discuss

What do we want out of the data tracker system and database?
What do we want out of the metadata system and database?
What are the similarities and differences between the systems? Can we combine them, or coordinate between them?
What are the access patterns for the data tracker system? internal and external
What are the access patterns for metadata information? internal and external
What kind of statistics do we want for the SDC? What statistics might be useful for external viewers? (EMM statistics page)

Requirements For processing database (By Maxine)

The processing database should store information about the status of processing data. Here are some ways we might use this table:

Processing jobs which depend on other files can access the status of those files via the processing database. This is probably the number one way we will use this database
It is possible that our triggering mechanism will check this database before triggering a step in the pipeline. This is a very similar use case to 1
For monitoring data processing, we can use the database to check across multiple search terms (for example, we can query if all L1b processing is completed for all instruments)
For monitoring data processing, we can track the time it takes for each processing step. This will help us fix inefficient steps or check for failing jobs.
(maybe) tracking file dependencies to allow for reprocessing

The main access method is probably going to be a specific file, or a specific time, instrument, and level. (to generate the file name we need that information as well, so maybe dividing keys based on those three things makes sense.)

Primary key: Instrument or Level Secondary Key: Instrument or Level Secondary Key: time

We want the keys to divide the data relatively easily. If there is a significant difference in the number of processing steps or files for different instruments, we should do something else.

Should we delete older processing information from the database? This would mean we would need to check the metadata store for some cases, but most likely, the only relevant info from this processing database is going to be very recent

Why can't we just see if files exist for this database? -> can capture time ranges or get multiple instruments, can track time and store averages from the past month etc

Requirements for metadata database (By Maxine)

The metadata database should include general file information for each file which is uploaded to AWS. This should include things like the upload source, the timestamps, the mission tags, etc.

This database is used for the following things:

To check on specific files to get specific information on them (such as size)
To trigger jobs based on files landing in the metadata DB?
To get general information on averages across all files? (size?)

Can we combine metadata and processing databases into one data file tracking database and system? i.e. database contains: file metadata, processing status, maybe dependencies, file instrument/level/time

Solution 1 - DynamoDB (written by Tenzin)

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and flexible storage for applications. It offers seamless scalability and high availability while automatically handling the complexities of hardware provisioning, setup, and maintenance. DynamoDB supports both document and key-value data models, making it suitable for a wide range of applications. With features like global tables for cross-region replication, fine-grained access control, and on-demand scaling, DynamoDB empowers developers to build responsive and reliable applications that can handle variable workloads and evolving data requirements. It's an essential tool for modern cloud-based applications that require efficient data storage and retrieval. (ChatGPT)

Pros:

DynamoDB has good support AWS and from community.
Easy to setup and tear down.
If setup correctly, we can keep cost of DynamoDB really low. For example, we can choose on-demand billing mode. This will allow database to scale if it gets lots of query but otherwise read/write throughput drops down which keeps cost down.
Easy to scale with no intervention.
Easy to access. DynamoDB UI is pretty interactive. You can visualize data using visualizer tool in NoSQL Workbench.
It can store any format of data.
Can backup data across region or to s3. We can enable Point-in-time recovery if database was delete by mistake. It gives 35 days to recover table.
It can be integrated with other data visualizer if visualizing data is important.

Cons:

When we are designing database structure, we need to think outside of relational database.
Designing parition key and sort is is important. There are work around if partition key and sort key limits performance. We can use global secondary index (GSI) or local secondary index (LSI).
Need to have enough information about what kinds of query user is going to make to develop a good design.

Additional notes:

note

Solution 2 - OpenSearch (written by Tenzin and Sean)

OpenSearch is an open-source, distributed search and analytics engine derived from Elasticsearch. It offers powerful full-text search capabilities, real-time data indexing, and advanced analytics features. Designed for scalability and high performance, OpenSearch can process and analyze vast amounts of data across diverse sources. With a rich ecosystem of plugins and integrations, it's adaptable to various use cases, from log and event analysis to application monitoring. OpenSearch provides organizations with the ability to extract valuable insights from their data and build applications that require fast, reliable, and customizable search functionality. (ChatGPT)

Pros:

It can scale like DynamoDB
It also has in-built data visualizer tools.
It can store files and key value.
It has good search tools and mapping tools. You can map data on a real map if it has geolocation data.
Fast, bulk metadata uploads
Fast search capabilities
Built-in dashboard for searching (GUI with search bar and filter options)
Built-in data snapshot capabilities
Built-in replication across "shards" that can act as a backup (separate from snapshots)
Has an opensource python API
Uses a REST API so it's easy to send simple requests with curl or HTTP requests
Stores data as json (not sure if this is really a pro, but seems like an easy way to organize the data when we send it over
Has built-in upload commands that prevent existing data from being overwritten

Cons:

OpenSearch has good support but not as widely as DynamoDB. It may be because Elasticsearch has been in the market before OpenSearch.
Not easy to setup and tear down. It takes at least half hour to one hour to setup or tear down.
It costs couple hundred dollar a month whether we make lots of read/write. It may increase if read/write throughput increase.
Can backup data. (but requires workaround solution to backup).
Have run into permissions issues from fine-grain access controls
Querying uses a OpenSearch Query Domain Specific Language that is kind of a pain to use. I have a utility class around it currently, but it's limited and if we wanted to allow for more complex searches, my utility class would have to be extended.
Python API documentation isn't great and can be a little annoying to use (I did write a wrapper and support classes to make it simpler to use for our purposes, but still there might be better, more documented APIs for other databases)

Additional notes:

Solution 3 - Relational database (RDS or Athena) (Written by Maxine)

A relational database is the classic SQL style for databases. It has a heavily structured schema and it is difficult to update the data structure. However, it provides powerful tools for filtering, sorting, and combining structured data.

Pros:

Can filter on any data fields without needing to pick a primary key
It can store a lot of information about one file in a very structured way
Very commonly known, more so than DynamoDB
Lots of support
Easy to read/write data

Cons:

once you decide on a structure, you are locked in
Have to understand what data access patterns will look like ahead of time

Additional notes:

If we aren't sure what the access pattern will look like, or if the data will get accessed in a lot of different ways for different purposes, it can be more flexible than DynamoDB, but it's more structured than OpenSearch.

Solution 4 - DocumentDB (Written by Maxine)

"A document database is a type of nonrelational database that is designed to store and query data as JSON-like documents. Document databases make it easier for developers to store and query data in a database by using the same document-model format they use in their application code. The flexible, semistructured, and hierarchical nature of documents and document databases allows them to evolve with applications’ needs. The document model works well with use cases such as catalogs, user profiles, and content management systems where each document is unique and evolves over time. Document databases enable flexible indexing, powerful ad hoc queries, and analytics over collections of documents." - AWS DocumentDB docs

Pros:

Ability to create structured but flexible schemas
Easy to update, but requires updating the schemas being used
High performance and good at scaling
Can be useful for unstructured data
Since the schema is stored in documents, it might be easy to migrate from CDF files to documents since they are structured similarly
Billed only for what you use for storage/input/output

Cons:

Requires more set up for the data than either key-value or elasticsearch style databases
Overkill for tracking file processing - although it does provide tools for aggregation, indexing, etc

Summary

Tenzin: I think DynamoDB seems like the right choice for this. It is supported by large community and is easy to setup and manage. It support scalability and flexbility that we need. It's easy to read and write data from DynamoDB. It is easy to integrate into other tools.

Maxine: I think if we determine the proper requirments and scope for how we want to use these data stores and what the access patterns will look like, it will become obvious which option is the easiest and most correct. These options align with different styles of data organization quite well, so if we determine our access patterns, they will probably only line up with one of these options. It is likely that opensearch will be applicable to all the different access patterns because it is so flexible, but it also has some drawbacks. It can be a good option if we really want that flexibility.

greglucas commented 11 months ago

I really appreciate the documentation and summary of all of this! I think this is really good for discussion. Here are some quick thoughts and additional questions I had.

Can we combine metadata and processing databases into one data file tracking database and system? i.e. database contains: file metadata, processing status, maybe dependencies, file instrument/level/time

I think this is a fantastic question and one I have also wondered about. We should think about what metadata means to us and what metadata we need for different tasks: is it file metadata, processing metadata, or something else...?

What questions do you need to ask the database vs. what questions can you ask the s3 bucket/object? Should we store the file size and all of that in a DB table or can we ask the database what files match our criteria, then go get all that metadata from the s3 objects themselves? I agree with us needing to figure out what questions we are planning to ask as being critical to the design decision here.

When do we plan on writing/updating to these tables? Do we write to the table when a processing job starts adding rows for the files it is expecting to produce or is there only one row for the processing job itself, maybe a field that could be updated with "products_created" or something like that instead of multiple rows?

What/Who is going to be updating these tables? Do we anticipate the Step Function to be updating the table, the Lambda jobs, an s3 bucket event? I'm not sure this will really matter in the database design, but it might be useful in thinking about what we want to store within the tables based on what information we have within these different services.

maxinelasp commented 9 months ago

Closing this issue, as we determined that this will be handled using RDS and possibly instrument-specific tables, along with CloudWatch #185 #173

IMAP-Science-Operations-Center / sds-data-manager

SPIKE - Infra design 2: process tracker and updater #72

Description

Requirements

Nice to have or Goal Requirements

Additional notes

Existing design

Related tickets

68

61

71

Follow up tickets

Solution 1

Pros:

Cons:

Additional notes:

Solution 2

Pros:

Cons:

Additional notes:

Summary

General things to discuss

Requirements For processing database (By Maxine)

Requirements for metadata database (By Maxine)

Solution 1 - DynamoDB (written by Tenzin)

Pros:

Cons:

Additional notes:

Solution 2 - OpenSearch (written by Tenzin and Sean)

Pros:

Cons:

Additional notes:

Solution 3 - Relational database (RDS or Athena) (Written by Maxine)

Pros:

Cons:

Additional notes:

Solution 4 - DocumentDB (Written by Maxine)

Pros:

Cons:

Summary