[USER STORY] Signalling the up-to-dateness of the data

sskagemo commented 3 years ago

As a: User interested in the fact about concrete companies

I wish to: get information about the up-to-dateness of the information I find in the STIRData platform

So that I: can make an assessment of the need to visit a more up-to-date source, for instance the authoritiative business registry directly

The up-to-dateness could for instance be shown as symbols or colors for categories, for instance "within seconds", "within minutes", "within hours", "whitin days" or something similar.

For example, althought the Norwegian Business Register has an API that makes it theoretically possible to keep the data in sync near real time, the data in the API itself has a fifteen minutes delay from the actual register.

Related to Epic: #4

sskagemo commented 3 years ago

Maybe it is difficult to make the up-to-dateness part of the data itself. So one way to solve this might be through the entry about the dataset (assuming that the up-to-dateness is the same for all the data in one dataset) in the data catalog. In the same way the data catalog can be used to identify distributions that conforms to the specification, it could be added some sort of delay related to the actual business register. Not sure if this is covered by DCAT or BregDCAT or any existing vocabularies today.

Such a mechanism would also support dynamic transition of sources to the best source of data. If a Business Register in the beginning is dependent upon a STIRData-partner to transform the data according to the specification, and later adds a new endpoint to the register itself, where the data is more fresh, and adds it as a new distribution in the data catalog, the consuming service would automatically identify the new service as a better source, through the data catalog, and switch to using it. This would support "design for change), i.e. #5

jakubklimek commented 3 years ago

There are several questions to consider.

Do we need to distinguish the freshness in terms of seconds and minutes? E.g. the Czech registry is updated daily (I would need to verify this), and the pipeline to transform the data to STIRData format takes currently almost a day to run as there are no diffs published, and the whole dataset always needs to be transformed. Originally, I thought we aim at approx. weekly updates, but this, of course, is use case dependent. Do we have use cases that would require near real time freshness?
There might always be a bit of a problem updating metadata in a catalog in real-time, since the metadata is typically harvested in a series of catalogs, with varying frequency (e.g. daily harvested local catalogs to national catalogs, which get harvested weekly into the European one, etc.) That is why there is the accrualPeriodicity, saying e.g. that the data is updated daily or weekly, without specifying the exact time and date of last update, which is left to the user to discover themselves. Having this in the catalog would mean coupling the data transformation processes and the data catalogization.

sskagemo commented 3 years ago

Good questions! I believe we have use cases where at least the hours are relevant, but I'm going to have a meeting with a bank interested in access related to Know Your Customer-information, to confirm that. f I believe the accrualPeriodicity-attribute could be a starting point, and maybe it is also sufficient. I believe this is also related to #5 and how the platform can dynamically evolve and let the consumers use the source with the most up-to-date data.

For instance, if a BR does not itself offer the data according to the STIRData-specification, but the data is transformed through a STIRData partner, like currently the Norwegian BR. Currently, this means that there is an update to the data every month (see https://docs.google.com/presentation/d/1U199vHq834BJw0v5TzFor7LQJJgvYAtOWpHnsRIXrFc/edit#slide=id.gf5e9ef95dd_1_0 ). This should be expressed in the catalog with accrualPeriodicity. If (when ...) the Norwegian BR starts to offer data in accordance with the STIRData specification, this can be added to the catalog as a new distribution, and all is equal to the existing distribution except for the accrualPeriodicity. Then a consumer that queries the catalog as part of a dynamic lookup for data about Norwegian Companies, can ask for the distribution that has the best update-frequency.

I don't think it requires any realtime-update of the catalog entry itself, as the accrualPeriodicity-attribute will typically be constant for a given distribution.

What do you think, @jakubklimek - does it make sense?

jakubklimek commented 3 years ago

I agree.

One technical detail though - a dataset with a different accrualPeriodicity will be a different dataset, not just a different distribution.

We already have our datasets registered in our catalog (https://stirdata.opendata.cz/datasets?keywords=Business%20Register) so we can also extend the metadata records accordingly, when the data updates start to be regular.

STIRData / user-stories

[USER STORY] Signalling the up-to-dateness of the data #29