Add ability to store metadata for data collections

mjmcclung commented 5 months ago

OP timer

User story

There may be times where we don't want to share table-level information, however it might still be appropriate to share collection-level information to let prospective users know that something exists. E.g. pay transparency reporting. It may be useful to know that pay transparency reporting information exists, but it's not likely that the specific tables or details will be shared or visible to most visitors to the catalog. Visibility of a data collection MR may be broader then the data asset MR.

There are also situations where certain metadata information (e.g. related documents) relates to a data collection as a whole and it would be more appropriate and efficient to link this information at the data collection level, rather than redundantly repeat it at the data asset level. Examples that come to mind include data models and information management plans that cover an entire collection.

There is also value in knowing details at the data collection level (e.g. security classification, IM classification) to drive broader data management assessments, activities and planning.

This feature support the following requirements

Managing Government Information Policy:

1.1 Ministries must be aware of, and able to account for, the information in their custody or control. This includes identifying, capturing, documenting, and managing government information in accordance with applicable legislation, policies, standards, and procedures.
1,5 Ministries must make information in their custody or control accessible and discoverable as appropriate.

Data Management Policy:

2.3 Ministries should ensure that data is not already available from a B.C. Government source (e.g., BC Data Catalogue and ministry data catalogues), or from other reliable sources before creating or collecting it.
5.1 For the data in their custody and/or control, ministries must develop plans, structures, metadata, data models, and data flow diagrams to ensure the data is understood and meets current and long-term needs.

Additional context

There is currently an ability to indicate a 'Series' in the catalogue, but it functions more like a label. There is no current way to add metadata to a series.

Proposed solution

Either add an additional Record type called 'Data Collection', or, preferably, add a qualifier on Data record types to indicate it is a collection (e.g. Does this metadata record represent a data collection?).

If the qualifier is true, it could unlock an additional build section of the MR to allow editors to specify what MR's are part of the collection ("Collection assets"). This could be similar to how assets are added for lineage, but without creating the actual lineage piece as lineage is more important at the asset level and not the collection level.

Specifying assets as part of a collection needs to create a linkage so that when users are on an asset MR page they can see what collection (if any) the asset belongs to and click on it to find out more information about the collection. Likewise, the collection MR should specify what assets are part of a collection and allow the user to explore them if needed (per visibility settings).

Collections could also be used as a search facet, similar to how Series is used now.

Estimated level of effort

1 day
2 days
3 days
4 days
5 days

Definition of done (DoD)

[ ] criteria 1
[ ] criteria 2
[ ] criteria 3

Testing

Automated functional tests

[ ] I have written functional tests for this feature
[ ] I have run the functional tests and they pass

Automated site tests

[ ] I have written site tests for this feature
[ ] I have run the site tests and they pass

This feature requires manual testing

first test step …
second test step …
etc …

ChristaBull commented 5 months ago

A solution like this could potentially help solve a problem I've started notice with our metadata records. It's not uncommon for us to have a multiple collections of data on the same topic that have very similar names that could cause confusion for consumers.

For example, with Property Transfer Tax (PTT) we currently have multiple versions of the data in the Finance Data Store. One table concept in a few of those transactions looks like:

Landed: tblBC_RtnPTTDetail
Staging Row-level: PTT_TRANSACTION
Analytical: PTT_TRANSACTION_FACT / PTT_TRANSACTION_DIM

If we have a data collection option we could clearly identify all of the tables that make up specific models (e.g. the analytical product) so the relationship is clearer to clients working with them.

Note, a workaround for my team might be to create another metadata record and link to the different parts within its description if we can link to them. This isn't a system-based connection but would have a similar visual impact.

NicoledeGreef commented 5 months ago

Thanks for the input. For enhancement/ideas, please characterize the business problem that needs solving, rather than how that problem should be solved. The "How" should be left to the development team.

Problem Statement: While the current Finance Data Catalogue (June 2024) allows metadata authors to describe Data, Form, and Report assets as metadata records (MRs) there is also a need to define a means of grouping together asset MRs under a banner entity which can have related documents defined that would infer application to the entire group of asset MRs linked to the banner entity.

How is this different than "Series"? As initially implemented, "Series" is a taxonomy-based attribute that can be applied to one or more MRs; the attribute value can be used as a filter by topic. It lacks dimension because it is essentially a tag and there is no way to expand the tag's relationship to other things; asset MRs can be tagged and commonality can be established but that is insufficient. The taxonomy values are mostly program names and they don't account for different groupings of data assets that may need to be described slightly differently rather than just tagged with the program name.

How is this different from "Assets used"? "Assets used" helps a metadata author define the lineage of their asset (the items on which it is dependent). The "Assets used" value is based off search results of all MR titles; any asset that can be searched and applied to an MR must already be named as an MR in the Catalogue.

The business seeks a solution that will allow a metadata author to describe a banner entity and define relevant attributes such as description, related documents, and relationships to asset MRs.

Note: The above problem statement intentionally steers away from the term "collection" as it is used in other contexts when it comes to data, e.g. "data collection" often means the act of collecting data for a purpose.

bcgov / MFIN-Data-Catalogue