paullatzelsperger commented 2 years ago

This Issue deals with creating a specification of the Data Management API based on the design considerations and recommendations stated in the test listed below ~~I didn't want to paste directly here, to avoid walls of text.~~

Core aspects

1. Endpoints vs Use cases

Following the REST principles, every endpoint should modify the state of an object rather than trigger behaviour. For example, to initiate a data transfer the endpoints should be POST http://your-connector.com/api/v1/transfer/transfer-request rather than POST http://your-connector.com/api/v1/transfer/start-transfer Consequently, uses cases (as outlined below) might not have a 1-to-1 translation into API endpoints, they might even involve several consecutive API calls.

2. `delete` is only possible sometimes

objects may only be deleted if they are not involved in ongoing contracts, etc. Updating is (initially) done through "delete-and-recreate"

3. support pagination

every GET endpoint should support pagination, sorting and filtering

4. Desired outcome

We should have a (or several) YAML files containing the specification of each of the endpoints. In addition, we should have Controller classes that implement that spec, even if most of them are just stubs returning dummy data for the immediate future. So, by referencing the data-management-BOM, a controller implicitly exposes the entire API. Note: we will NOT consider the standalone deployment scenario, especially the communication channel, in this issue

5. Module structure

Every domain object should be reflected in the module

extensions/api/data-management/asset                      //includes DataAddress
extensions/api/data-management/contractdefinition
extensions/api/data-management/contractnegotiation
extensions/api/data-management/transferprocess
extensions/api/data-management/policy                      //includes PolicyDefinition

a BOM should exist for the entire API.

Full proposal

Cross-cutting aspects

Terminology

domain object: (or "business object") an object such as Asset, DataAddress, ContractDefinition etc. May contain potentially internal information.
entity: separate database-specific object, that is used to persist a domain object. May differ from business objects and may contain database-specific information, such as IDs, foreign keys. For example, the ContractAgreementEntity would not contain the entire Asset, but merely maintain a foreign key to an AssetEntity. Entities are never to be passed outside the persistence layer.
DTO: public representation of a domain object that is specific to a use case. Should contain only the information that is necessary. Different use cases may warrant different DTOs for the same business object.

Object Lifecycle

Most domain objects go through a lifecycle where transitions might or might not be possible at a given time. Or, certain transitions may trigger subsequent actions, which could potentially be long-running or even involve human interaction. For example, decommissioning an Asset that is involved in a ContractAgreement might cause the contract to expire, or not be possible at all. A similar situation exists for ContractDefinitions.

We'll assume that domain objects have some sort of state associated with them, regardless whether that is implemented using an actual state field, or a computed state, or anything else.

We propose the following definition of states domain objects can make use of:

INITIAL: object is not included in any communication with other dataspace participants. This state is optional, API clients may choose to "publish immediately".
PUBLISHED: object is (potentially) visible to other participants.
DEPRECATED: object is only available to existing relations, e.g. an Asset is not included in new contract offers anymore.
DECOMMISSIONED: object is not in use anymore (i.e. not included in any active contract agreement or offer)
DELETED: object is not visible to clients anymore, just exists for traceability reasons.

Decommissioning: as a starting point let's go with option a) as it is simpler to implement. If necessary, we can change this later.

Authentication

This solely deals with access to the API and is completely separate from the IdentityService and connector-to-connector authentication. There should be a thin authentication-spi extension that contains an AuthenticationService, which could then be backed by an implementation extension LDAP, AAD, Apache Shiro, Keycloak, OAuth2/OIDC, etc. In a first PoC it could be sufficient to have an API-Key-based authentication implementation.

Updating and deleting items

Modifying certain domain objects could potentially have complicated and legally relevant consequences, therefor there is no UPDATE API for those domain objects. There will be DELETE endpoints, which return an error when the item in question cannot be deleted, e.g. when an Asset is already involved in a ContractAgreement.

Consequently, whether or not an item can be "updated" (i.e. deleted and recreated) depends on its state, which could be an explicit field on the domain object, or it could be a computed property, e.g. by checking whether an Asset is involved in a ContractAgreement or ContractNegotiation.

That said, creating new items is always possible, but they will have a different identity.

API layer

In order to decouple the API from the internal domain objects and entities there should be a dedicated API layer, that contains:

domain objects: those are the data objects we have now, e.g. Asset, ContractAgreement, etc. They are not database entities!
DTOs: specific to a use case. should contain as little information as possible (e.g. ID instead of full object). Cannot contain internal information.
transformers: convert DTOs into the corresponding domain objects (Asset, DataAddress, ContractDefinition, etc.) and back.
services: must model a particular use case. For example, there should be an AssetService, that has methods uploadAsset(), publishAsset(), and rescindAsset(). Such a service would then maintain references to the AssetIndex, the ContractDefinitionStore, etc and be responsible for transactional consistency.
Pagination/Sorting: all Read endpoints must support filtering, sorting and pagination. Even though cursor-based pagination is widely used nowadays, it is mostly used for real-time applications (things like Facebook) where data is expected to change constantly. Here, we'll deal with mostly static data, so we can reap the benefits of offset pagination as it is easier to implement. A Pagination object could like this:
```
class PagingSpec {
int getOffset(); // default: 0
int getLimit(); // default: 50
@Nullable String getSortCriteria(); // refers to one particular field of the object
@Nullable String getFilterExpression();
SortOrder getSortOrder(); // can be ASC or DESC
}
```
This will greatly improve usability and performance in UI views such as tables and lists, and will reduce stress on the involved systems.

On a related note, the entities that we have now, e.g. Asset, ContractDefinition will most certainly differ from the actual database entities. For example, a ContractDefinition would not contain the Asset as nested object, but would maintain a foreign key to it. Thus, the way domain objects are modeled may/will differ from how database entities are modeled.

Assets + DataAddress

`Create`

request must contain the Asset and the DataAddress, both stored transactionally. The DataAddress may initially be an empty skeleton.
DataLoader can be used to do that.

`Read`

GET endpoints for Asset and DataAddress should exist.
use the AssetIndex with a SELECT_ALL query
UI implementations could create filter expressions -> use Criterion and add pagination

`Update`

Updating the DataAddress is done through a separate endpoint. Copy/clone is not needed, DataAddresses can just be modified as they are an internal concept and do not modify the characteristics of an Asset.
Assets can not be updated.

`Delete`

Deleting Assets is only possible while they are not included in an ongoing ContractNegotation or ContractAgreement. The request will fail otherwise.
If an Asset is still in this manner "unpublished", deleting it is straight-forward.
again, a Danger Zone could be implemented to blindly delete an Asset without looking left or right.

Danger Zone

As a convenience feature we could implement a danger zone, where deleting assets is possible but no validation is performed, which could put the application's data backend in an invalid or inconsistent state.

Policies + PolicyTemplates

Policies are pre-loaded as so-called PolicyTemplates and are stored in a PolicyTemplateStore. Upon instantiation of a policy e.g. in the ContractDefinition, the template is copied into the ContractDefinition with a new policy ID.

The reason we propose to actually copy the policy, as opposed to reference it, is that copying them would clutter the Read endpoint with a potentially high number of almost identical policies. Assuming that this endpoint will be used to select policies for the creation of ContractDefinition, it is simply a matter of usability. Furthermore, a ContractDefinition is legally binding and should remain in place, even if the backing policy changes at some point.

In order to bring this feature back in line with other domain endpoints, there could be an additional PolicyStore, where all instantiated policies are stored, so as not to persist them embedded in a ContractDefinitions.

`Create`

no special considerations. Copies a PolicyTemplate into the PolicyStore effectively "instantiating" it.

`Update`

PolicyTemplates can simply be updated using copy/clone semantics, without special considerations. Instantiated Policy objects are immutable.

`Delete`

PolicyTemplates can simply be decommissioned (using the lifecycle process), without special considerations.

ContractDefinitions

`Create`

request must contain ID of contractPolicy and accessPolicy (cf. previous section)
Policies must pre-exist in a PolicyTemplateStore and upon creating the ContractDefinition the policy is simply copied.
Those copies have different PolicyId values, so they are technically different policies, there is no way to backreference the PolicyTemplate.

`Read`

the GET endpoint should accept an optional paging object
if we want to re-use the ContractDefinitionStore, the interface needs to be amended to accept those parameters

`Update`

no update, just "create-new"

`Delete`

Deleting a ContractNegotation is only possible as long as there is no ContractAgreement based off of it. There is no way of knowing whether a ContractDefinition is involved in an ongoing ContractNegotation, so deleting it will cause the CN to fail. ``

Lifecycle aspects Decommissioning: The ContractDefinitionStore cannot return decommissioned or deprecated definitions.

ContractNegotiations

`Read`

no special considerations, filtering and pagination must be implemented.

`Update`

not possible through the API.

`Delete`

not possible throught the API.

`Cancel`

There could be a CANCEL endpoint which would effectively move it to a (new) CANCELLING/CANCELLED state.

`Decline`

see Cancel, would move it to the DECLINING/DECLINED state.

Q: Should the ContractNegotiationManager use copy/clone semantics as well?upd

ContractOffers

`Create`/`Read`/`Update`/`Delete`

ContractOffers are not domain objects in the conventional sense, they are generated on the fly for every participant, so we cannot directly "touch" those. They are protocol details rather than domain objects. Rescinding an offer would have to be done by updating the ContractDefinition. To browse other participants' offerings, see use case "Browse Catalog"

TransferProcesses

`Create`/`Update`/`Delete`

not possible

`Read`

no special considerations, filtering and paging must be implemented.

`Cancel`

Since there is no delete operation, the CANCEL endpoint would effectively send a TransferProcess to a CANCELLED state in the next tick-over. The CommandQueue must be used for this.

Note: there will be other APIs pertaining to TransferProcesses where they can be updated. Necessary endpoints would be:

PROVISIONING/DEPROVISIONING done (= state changes, could be done with status checkers that listen on a webhook)
USAGE POLICY expired
DELETION confirmed
DATA HAS CHANGED (sending notifications to EDC, e.g. when an AI model has improved)
The API of backend data services needs to be done as well.

Use Cases

Upload an Asset
Publish Asset
Rescind/unpublish an Asset
Browse Catalog
Request access to an Asset (i.e. start negotiation)
Cancel request (= decline or cancel negotiation)
Transfer an asset, potentially enter data destination on-the-fly
Cancel data transfer, up until it's actually in IN_PROGRESS
Deprovision data transfer

Linked Issues - Preliminary work:

[x] #583
[x] #584
[x] #593
[x] #586

Linked Issues - endpoints

[x] #587
[x] #588
[x] #589
[x] #590
[x] #591

This Issue replaces/supersedes #476 [EDIT 1] kicked "copy/clone" [EDIT 2] corrected some whitespaces [EDIT 3] update description and changed "TL;DR" -> "Core aspects", added sub-issues

jimmarino commented 2 years ago

Good start. One quick comment:

extensions/api/data-management/asset

In my mind doesn't model an asset but an "asset/data address pair". We should come up with a different name for it. You could probably make an argument that "Asset" without the data address is one representation of this "pair" but I think it is a separate thing. For example, the data address of an asset can change without that semantically resulting in a new asset.

paullatzelsperger commented 2 years ago

We've called that an AssetEntry in previous implementations. That's also what the DataLoader utilizes at the moment. WDYT?

florianrusch-zf commented 2 years ago

We've called that an AssetEntry in previous implementations. That's also what the DataLoader utilizes at the moment. WDYT?

Would that mean we have two REST resources?

AssetEntry (asset + data address pair which is modifiable)
Asset (asset object which could only be cloned)

paullatzelsperger commented 2 years ago

Would that mean we have two REST resources?

AssetEntry (asset + data address pair which is modifiable)

Asset (asset object which could only be cloned)

I would even say there are three different resources:

AssetEntry (or whatever we wanna call it) - can only be created
Asset - can be read, updated (cloned) and decommissioned. The latter will delete the associated DataAddress
DataAddress - can be read and modified to your heart's content as it does not impact the Asset directly.

florianrusch-zf commented 2 years ago

To 1: 👍 I'm fully with you!

To 2: How would the copy/clone endpoint look like? Would it be like a "normal REST endpoint" ([PUT/PATCH] /asset) which creates a new object?

To 3:

Regarding the fact, that we have some relationships between the resources/objects, we could also thing about to support HATEOS. Maybe not in the first round, but maybe later on.
Just also a thought, but what is with API versioning?

To 4: Two things:

What does BOM means in this context? If I understand you correctly, you mean that the API is exposed when someone uses the data-management lib/module. The first though of me was that BOM would mean "bill of material", which confused me a little bit 😄
I'm not so comfortable with having dummy APIs on the main branch. Do we have some gaps to fill until we can create the individual endpoints? I would split the work in multiple PRs for each resource endpoints. That way we can work in parallel and also have on the main branch only what is finished.

Regarding the documentation of the APIs in the meantime: We could create a openAPI doc of how the API could look like when we are done with the implementation. This could then be also used as an acceptance criteria for the PRs, and we could also use it to test the API in the CI afterwards. (On my last project, we worked with Spring Cloud Contracts to ensure that the APIs would work as intended during further development after the API was implemented. But I'm not sure how we could do that in this project. WDYT? 🤔)

paullatzelsperger commented 2 years ago

lets not get ahead of ourselves here. This is merely the issue to work out the API spec. Of course that will result in multiple PRs. All I want to achieve here is to have consent, about the requirements, limitations and scope of the data mgmt api.

copy/clone: dont know yet, we should solve this as generic as possible. PUT and PATCH both seem appropriate verbs.

HATEOAS: in my personal view, HATEOAS won't solve any problems. It only makes things more complicated and convoluted.

"BOM": you're right. It refers to the fact that there should be a single parent build file that "gives you" all the API submodules, and exposes them. So what you said is correct :) Take a look at extensions/catalog for an example.

From a workflow perspective it is much easier to create actual java code and generate the YAML spec out of them, than the other way round.

Dummy endpoints are fine, as the entire EDC has not yet reached a stable state and it will give people a chance to implement/generate their client stubs.

DominikPinsel commented 2 years ago

Thanks for your suggestion, although I would have appreciated it if you could have added it to the original Data Management API issue, to not have two duplicated issues about the same topic. Original Issue: https://github.com/eclipse-dataspaceconnector/DataSpaceConnector/issues/476

You mention concepts like version counters, back pointers to ancestors, lineage of domain objects, an AssetSchema, read only properties, a new Policy Template, a new Policy Store and so on. I see here a lot of added complexity in your proposal and don't understand why all this is necessary.

I can see clearly where you are coming from. The manipulating of some these objects might have legal consequences. An important issue to address.

First, I assume there is a little of a misunderstanding in this proposal regarding the ContractAgreement. At the time of writing the ContractAgreement does contain the Asset objects itself (to ensure immutability). This protects the ContractAgreement from unintended side effects. For example when an Asset, ContractDefinition or Policy is updated. I like this solution because it's simple and easily enforceable by preventing updates on the ContractAgreement. What are the benefits of your proposed solution (in comparison to the existing solution)?

Second, I would like to address a new level of complexity that is introduced here. I assume now you suggest replacing the assets of the contract agreement with a pointer, pointing to that asset? And then you would do the same for all the other objects, that contain the Assets, too. For example the ContractOffer or the TransferProcess. I assume for the objects of the new PolicyStore you would probably do something similar, so that the ContractOffer and -Agreement only reference a policy id.

In conclusion this could mean, that the storage that contains the Assets, might be another storage that contains the assets for the ContractAgreement. And there could exist a third storage for the TransferProcess (and so on). This might make correctly executed XA-Transactions mandatory for some operations. A feature that is probably not supported by all storage systems (e.g. I could not find something about XA-Transactions for Azure CosmosDb).

As you can see I see a lot of risks and complexity here. Maybe you can elaborate why or where such complexity might be necessary in the future. From my point of view it looks like the most important object we need to be extra careful about is the ContractAgreement. But at the same time, this is also the only object the user should never be updated (except maybe the DataAddress). What did I miss?

Additionally, some objects might be updated pretty regularly, like the TransferProcess or the ContractNegotation. States could move here back and forth. Where is the risk here in comparison to "just update the object"?

Thanks in advance.

_{Dominik Pinsel dominik.pinsel@daimler.com, Daimler TSS GmbH, legal info/Impressum}

paullatzelsperger commented 2 years ago

You mention concepts like version counters, back pointers to ancestors, lineage of domain objects, an AssetSchema, read only properties, a new Policy Template, a new Policy Store and so on. I see here a lot of added complexity in your proposal and don't understand why all this is necessary.

There are no risks, it is necessary because there are requirements for it. Whether or not it's complex is a moot point.

First, I assume there is a little of a misunderstanding in this proposal regarding the ContractAgreement. At the time of writing the ContractAgreement does contain the Asset objects itself (to ensure immutability).

No misunderstanding. Immutability can be ensured through other avenues than embedding it. From a data modelling perspective it makes no sense to embed (and thus duplicate) the Asset in the ContractAgreement. We'd simply run into consistency problems (modifying the original asset, backreferencing from ContractAgreement to Asset, etc.).

This protects the ContractAgreement from unintended side effects. For example when an Asset, ContractDefinition or Policy is updated. I like this solution because it's simple and easily enforceable by preventing updates on the ContractAgreement. What are the benefits of your proposed solution (in comparison to the existing solution)?

First of all, there are no updates on ContractAgreements. FWIW, there are no updates (in the strict sense) at all. As stated before, every "update" is actually a clone, so the original Asset that belongs to a ContractAgreement remains immutable. The benefits of this are consistent programming models, clear and traceable object relations and lineage and a clean data model.

Second, I would like to address a new level of complexity that is introduced here. I assume now you suggest replacing the assets of the contract agreement with a pointer, pointing to that asset? And then you would do the same for all the other objects, that contain the Assets, too. For example the ContractOffer or the TransferProcess.

I'm not talking about pointers. There is a difference between domain objects having references and database entities having foreign keys. No, I would not do the same. ContractOffers are protocol messages rather than domain objects, and TransferProcess don't contain assets.

I assume for the objects of the new PolicyStore you would probably do something similar, so that the ContractOffer and -Agreement only reference a policy id.

From a client perspective need to get a list of available policies (or rather: policy definitions) and use them in e.g. a create ContractDefinition operation. As of now we have no way of reusing policy definitions. This is a problem and must be addressed with something akin to a PolicyDefinitionStore. There is the notion of "instantiating" a policy, as opposed to "referencing" it.

In conclusion this could mean, that the storage that contains the Assets, might be another storage that contains the assets for the ContractAgreement. And there could exist a third storage for the TransferProcess (and so on). This might make correctly executed XA-Transactions mandatory for some operations. A feature that is probably not supported by all storage systems (e.g. I could not find something about XA-Transactions for Azure CosmosDb).

No. All Assets are in one place. That's the beauty of it. The ContractAgreement in one store would simply reference the Asset in the AssetIndex. Every domain object potentially has its own storage, this design decision has been made long ago. As I have explained before, there really is no way around distributed transactions, even if some systems don't support them fully.

As you can see I see a lot of risks and complexity here. Maybe you can elaborate why or where such complexity might be necessary in the future. From my point of view it looks like the most important object we need to be extra careful about is the ContractAgreement. But at the same time, this is also the only object the user should never be updated (except maybe the DataAddress). What did I miss?

I see no no risks, just complexity. That's due to the fact that we're dealing with a highly distributed and extensible system. Even if we follow your original proposal, we'd end up with the same problems, just at a different location, which - if not addressed - will eventually produce a non-functional system.

Additionally, some objects might be updated pretty regularly, like the TransferProcess or the ContractNegotation. States could move here back and forth. Where is the risk here in comparison to "just update the object"?

There is a huge difference between objects being modified through the API and modified internally, such as TPs or CNs. In the latter case copy/clone is not necessary.

All those points you raised have been discussed, it'll all turn out plausible once my proposal for an API is finished.

MoritzKeppler commented 2 years ago

I would be great to have the whole spec from the attached md included in the first comment and keep it updated based on the comments - similar to how Jim is doing it in #463 We can move it to appropriate places in the repo (docs, javadoc, glossary, ...) as soon as we start to work on it. _{Moritz Keppler moritz.keppler@daimler.com, Daimler TSS GmbH, legal info/Impressum}

paullatzelsperger commented 2 years ago

that'll be a wall of text incoming :) i'll update the original issue descr as there are already comments.

MoritzKeppler commented 2 years ago

every modification triggers the creation of a new domain object.

hm, so in case a DataAddress changes a client needs to call update DataAddress to create a clone with the updated information update Asset to create a clone of the Asset pointing to the new Address update ContractDefinition to create a clone of all ContractDefinitions referencing the old copy of the Asset delete ContractDefinition to remove all now deprecated ContractDefintions?

guess I'm just confused, thx for some enlightenment!

MoritzKeppler commented 2 years ago

Deployment Scenario 2: As a standalone runtime ("API server")

just a standard api gateway placed in another security zone? Or would it include more logic?

MoritzKeppler commented 2 years ago

In some situations there may be requirements to have mutable/immutable

wouldn't it be sufficient to exclude immutable fields from the DTOs?

paullatzelsperger commented 2 years ago

Hey @MoritzKeppler , i'll try and answer all in-inline

hm, so in case a DataAddress changes a client needs to call update DataAddress to create a clone with the updated information update Asset to create a clone ...

The DataAddress can be modified directly without copy/cloning it, as it is an internal concept and existing contracts won't be affected. I should have explicitly stated that, will update accordingly.

just a standard api gateway placed in another security zone? Or would it include more logic?

Exactly. There would be a separate runtime/process, potentially running in another process and/or on a different server. This would require additional logic in so far as that runtime would need a communication channel with the EDC (e.g. rest, websockets). That "API runtime" should not directly modify persisted objects to avoid consistency problems.

wouldn't it be sufficient to exclude immutable fields from the DTOs?

Yes, for the API that would be sufficient. Having that feature directly "on the Asset" however, would enable this on a wider scope. I'm fine with starting with the DTO approach and put the rest off for later.

MoritzKeppler commented 2 years ago

thx! So I am in favor of ignoring the api gw / security zone topic for now and solely concentrate on api definition. Also fine for me to start with the "DTO approach" to protect the immutable fields.

paullatzelsperger commented 2 years ago

@MoritzKeppler updated the description.

paullatzelsperger commented 2 years ago

closed as all subtasks are complete

eclipse-edc / Connector

Specification of the Data Management API #542

Core aspects

1. Endpoints vs Use cases

2. delete is only possible sometimes

3. support pagination

4. Desired outcome

5. Module structure

Full proposal

Cross-cutting aspects

Terminology

Object Lifecycle

Authentication

Updating and deleting items

API layer

Assets + DataAddress

Create

Read

Update

Delete

Danger Zone

Policies + PolicyTemplates

Create

Update

Delete

ContractDefinitions

Create

Read

Update

Delete

ContractNegotiations

Read

Update

Delete

Cancel

Decline

ContractOffers

Create/Read/Update/Delete

TransferProcesses

Create/Update/Delete

Read

Cancel

Use Cases

Linked Issues - Preliminary work:

Linked Issues - endpoints

2. `delete` is only possible sometimes

`Create`

`Read`

`Update`

`Delete`

`Create`

`Update`

`Delete`

`Create`

`Read`

`Update`

`Delete`

`Read`

`Update`

`Delete`

`Cancel`

`Decline`

`Create`/`Read`/`Update`/`Delete`

`Create`/`Update`/`Delete`

`Read`

`Cancel`