EMMC-ASBL / oteapi-core

OTEAPI core components
https://EMMC-ASBL.github.io/oteapi-core
MIT License
7 stars 4 forks source link

New minimal "semantic" data model for sessions #176

Open CasperWA opened 2 years ago

CasperWA commented 2 years ago

In a recent discussion between @quaat, @jesper-friis and myself, we decided to move in a direction where the session object is not used to transfer data in any way, but rather reference semantic objects (like DLite collections or OSP-Core entities) in which the data is put/stored, which can then be referenced and invoked in the individual strategies as needed.

A first step to moving in this direction is to expand the SessionUpdate pydantic model with some minimum fields that may not be overwritten in sub-classes and are information complete with respect to retrieving the semantic object and understanding which framework to use (DLite, OSP-Core, etc.).

sygout commented 2 years ago

Why is it needed? Isn't it a big change for the framework? Will it destroy compatibility with previous plugins?

CasperWA commented 2 years ago

We're trying to not destroy compatibility with previous plugins, but only for a while. Our intention is to move to a more minimalized session, and keep data in a semantic container, in the default case a DLite collection. However, it could be using any framework for this, internally, this depends mainly on the strategies and the overall service that installs this package and a select group of plugin packages.

jesper-friis commented 1 year ago

See also the description of session_type/session_id in suggested in issue #177.

In addition to them, the download strategy needs a standard way to communicate how to retrieve the downloaded content, e.g. the key under which the content is stored in the data cache. Note that the download strategy has no idea about the meaning of the downloaded content, so it make no sense for it to try to use session_id to store (the reference to) the content within the underlying interoperability platform.

Would introducing standard download_type/download_key fields in the session be sufficient? For example

download_type="datacache"
download_key="7e9a7074-a72b-4ccd-9580-7cfd7be516c0"  # a hash of the downloaded content used as key in the datacache

To not bother all parse strategies with these details, OTEAPI could provide a utility function get_downloaded_content(session) that returns the downloaded content. That would also make it much easier to change things later.


Note, if we have several download strategies after each other in one pipeline this wouldn't work. But that is not how the pipelines are supposed to be used. However, the following should actually work

pipe1 = download1 >> parse1 >> mapping1
pipe2 = download2 >> parse2 >> mapping2
pipe3 = pipe1 + pipe2 >> mapping3 >> transformation
pipe3.get()

since the get() method of parse1 will see the values of download_type/download_key assigned by download1 while parse2 will see the values of download_type/download_key assigned by download2.

jesper-friis commented 1 year ago

When splitting the dataresource datamodel into download (may be called dataresource) and parse, it was suggested that mediaType should be in the download datamodel. But its value is needed by the parse strategy, so the download needs somehow to also communicate this to the parse strategy. Maybe yet a new standard download_mediaType field in the session is required for this?

It may be generalised to download_configuration if we expect that the parse strategy may utilise more fields from the download configuration.

jesper-friis commented 1 year ago

In addition to session_type and session_id suggested in issue #177, we may also need a session_configuration field of type dict. For instance, for a "dlite" session on a distributed system, we need to specify which storage to store the collection in for communication between strategies. Such information could go into session_configuration.

jesper-friis commented 1 year ago

An important question that has not been answered here is where the session_type, session_id and session_configuration should be provided. These fields are unrelated to the data documentation and therefore doesn't belong to the configuration of the individual strategies.

This is the same question as addressed in issue #211.