Re-structure Capella Bucket=>Scope=>Collection configuration

gopa-noaa commented 1 month ago

No change to bucket, 3 scopes , development, integration, production, and 3 collections under each, currently just METAR, RAOB, and COMMON.

randytpierce commented 1 month ago

It implies that the metadata is moved from METAR collection to COMMON collection and that METAR collection will only have type "DD" documents (the same for RAOB collection). This will require code changes to ingest, metadata scripts, and client. randy

On Wed, May 29, 2024 at 11:06 AM Gopa @.***> wrote:

No change to bucket, 3 scopes , development, integration, production, and 2 scopes under each, currently just METAR and COMMON.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/issues/379, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPSO6P5YDQA2D2J6CNDZEYDJBAVCNFSM6AAAAABIPLDB3CVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMZDGOBRGY4DQOI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Randy Pierce

gopa-noaa commented 1 month ago

From a quick Google-ing a scope cannot be renamed after it is created. Have sent email to Couchbase ... Worst case, we can do the following:

create new "development" scope
create a new "METAR" collection
re-configure our XDCR to vxdata=>development=>MEAR
wait for data sync to complete
Delete the original _default=>METAR

ian-noaa commented 6 days ago

A couple of other questions:

Would it make sense to move more document types out into their own collections? I think we currently have MD (Metadata), DD (Data Document), and JOB/JOB-TEST documents. Are there other document types that would make sense to put in their own collections?
Could our document types be replaced by using collections more? If they are useful, when does it make sense to have a collection vs a document type field? E.g. - if the METAR collection solely contains type=DD documents, I could see dropping the type field unless there are reasons clients need to track that type.
Should the JOB-TEST docs be renamed to JOB and left in a "test" scope?
Does it make sense for the scorecard to be its own scope or does it make sense to be scoped with the rest of vxdata?
Can we XDCR at a scope level instead of a collection?

ian-noaa commented 4 days ago

To summarize the discussion from the dev meeting:

We decided we need to move this issue up and address how best to use collections, scopes, and buckets for our project & application.

We would like to come up with some use cases & whiteboard through how key parts of the application lifecycle would work with different data models. Ideally this would happen during the ingest meeting.

During the meeting we

debated what would go into a common collection. The point was made that common is pretty generic (like default) and it could be better to have explicit & meaningful names to describe the data that collections hold so that we don't end up with a grab bag of data. However, we’re unsure of the performance tradeoffs of multiple collections.
Called out that we will need scripts or SDK calls to create our DB schema if it becomes more complicated.

Information needed

What are collections, scopes, and buckets? What are their use cases?
How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?
It'd be useful to get a list of the Types, DocTypes, and Subsets we have in our documents and an idea of how we are using them. @randytpierce and @gopa-noaa may have the best input here.
Can we use collections/scopes/buckets to obviate some of the above fields (type, docType, and subset) in our documents? And do we want to? (I suspect no, to support our archiving & retrieval use case)
What use cases should we explore to ensure we have thought the DB schema through? This is something @ian-noaa, @randytpierce & @gopa-noaa should consider by the vxingest meeting. Off the top of my head I have:
- Ingesting data via cron, for various data types if relevant
- Ingesting data via event, for various data types if relevant
- Expiring data
- Retrieving archived data
- Querying data from MATS
Where does the scorecard fit into this? Should the data be stored in a separate bucket, scope, or collection?

Context

Couchbase Server 7 (released in 2021) introduced Scopes & Collections. Previously it was recommended to put all data in a “Bucket” and distinguish the documents with a type field. It appears scopes are recommended for data isolation (prod/dev environments, introducing schema changes, etc…) and collections are intended as a replacement for the previously recommended “type” field.

gopa-noaa commented 1 day ago

This link explains Collections and Scope: https://docs.couchbase.com/server/current/learn/data/scopes-and-collections.html

Just noting down some salient points below:

A collection is a data container. Up to 1000 collections can be created per cluster. A collection can be indexed; and it can be dropped. The data in a collection can be replicated, by means of XDCR.

A scope is a mechanism for the grouping of multiple collections. Up to 1000 scopes can be created per cluster. A scope can be dropped. A scope cannot be indexed. The contents of a scope can be replicated, by means XDCR.

Benefits of Scopes and Collections The benefits of scopes and collections include:

The logical grouping of similar documents; potentially simplifying operations such as query, XDCR, and backup and restore.

The increased efficiency of indexing, due to the Data Service being able to provide documents from specific collections to the Index Service.

Simplified querying, since query statements are able to easily specify particular subsets of documents.

Easier migration from relational databases to Couchbase Server, since collections can be designed to correspond to pre-existing relational tables.

Secure isolation of different document-types, within a bucket; allowing applications to be specifically authorized to use only their appropriate subsets of data (see Access to Scopes and Collections, below).

This should help give us some guidance in organizing our document hierarchy. Lets plan to discuss further.

ian-noaa commented 1 day ago

Thanks, Gopa! That makes it sound like it would be beneficial to explore using collections more.

2. How do collections, scopes, and buckets interact with XDCR & Time-To-Live fields?

TTL fields

Couchbase can have a default TTL set on buckets and collections but not scopes. You can also use the SDK to set TTL individually for each document. If we went the second route, having the import process be in charge of setting TTL values would seem to make sense.
See Couchbase's Data Expiration docs.

XDCR

Is configured at the bucket level. However, filtering can be applied to map data to different collections or exclude collections/documents.
XDCR will not automatically create scopes and collections. Scopes & Collections must be preconfigured on each DB cluster.
See XDCR with Scopes & Collections

NOAA-GSL / VxIngest