fybrik / fybrik

Fybrik
https://fybrik.io
Apache License 2.0
131 stars 52 forks source link

Write Blog on Write Flow #1608

Closed flora177 closed 1 year ago

flora177 commented 2 years ago

Describe the 2 scenario - allow/deny writing Consider describe 2 options - WKC and Katlog (see if previous blogs mentioned Katalog) + leave a place holder for Open Metadata

Expected time: w/o Open Metadata - Aug 10 New estimated time (on Aug 11): Sep 15

FYI - @simanadler @roytman

revit13 commented 2 years ago

Options for the blog story: @simanadler @shlomitk1 @Mohammad-nassar10

Reference: Add notebook sample to show the case of writing new asset · Issue #1459 · fybrik/fybrik (github.com)

Option 1

Another potential scenario is:

There is an existing catalog entry, which is in geography B
User's workload is in geography A
User requests to write to the existing asset
Deny is received, due to a governance policy that says workloads in geography A cannot write to datasets in geography B
@shlomitk1 @ERES Would this make sense?

comments: Tables is s3 are immutable so the only additions that the user could do to an existing table are to use the "append" mode to add rows. I think the story would make sense if Eva (the data scientist) added columns or a new table to save her computations. Right?

Option 2

Another potential scenario is the user is requesting to write to a type of storage (ex: dropbox) not supported by the enterprise . In that case the governance policy would relate to the connection received from the catalog.

User creates a catalog entry ahead of time that has a connection to dropbox
In FybrikApplication user indicates the catalog entry and that he wants to write to it
Governance policy says that if the connection is of type dropbox, then deny. (If we don’t know how to parse the connection to figure out it’s dropbox, take a shortcut for now and just put a tag on the catalog entry)
Fybrik should receive deny from the governance engine

comments: I think in a blog the IT admin creates the connections and it would not make sense if he will create a connection that is unsupported by the enterprise. right? comments/questions (Flora):

  1. Do we expect a data user to write a connection?
  2. Can we label a connection with "read only"
  3. How do we write a policy regarding connection?

Option 3

The story begins where the ING and IBM blog ends. There is multi-cluster setup with one cluster in the Netherlands and one in Turkey. Eva (is a data scientist) runs the notebook in the Netherlands. There is one object storage resides in Turkey and one in the Netherlands. There is a policy that dictates writing data close to the computation. However, the storage in the Netherlands is full according to the storage purchase plan. (??? @simanadler are there other reasons to not use the Netherlands storage in this case? having the implicit copy in the blog using the Netherlands storage in a similar case makes it hard to find reasons not to use it) Thus when Eva wants her calculation on the data to be saved in a new table, which contains for example average amount for each type of transaction, she gets an "error" result. She asks Tim the IT admin to extend the storage plan in the Netherlands and thus succeeds to write her data afterward.

comments: the policy is not government policy but rather a hard config policy.

Option 4

Unrelated to ING and IBM blog

The scenario you probably want to demonstrate is:

a new asset has a sensitive field
Fybrik is deployed with storage in regions r1 and r2 (whatever their names are...)
A governance policy forbids to write sensitive data to regions r1 and r2

details:

There is multi cluster setup with one cluster in the Netherlands and one in Turkey. Eva (is a data scientist) runs the notebook with sensitive data in the Netherlands. (??? no details on how the data was ingested?) She wants to save her computations in a new table. There is one object storage that resides in Turkey and Greece. A governance policy forbids writing sensitive data to regions Turkey and Greece and thus her request to write a new table is denied. Eva consults with Tim the IT admin who decides to allocate new storage in the Netherlands. Eva re apply her request which now succeeds.

shlomitk1 commented 2 years ago

Option 1: Geography constraint seems a little bit artificial. A new idea: there is a "read-only" dataset that can be updated only by the data owner. Users who don't own the data can not update this dataset.

shlomitk1 commented 2 years ago

Option 2: The connection is valid but the enterprise allows only reading from it and not writing to it (because the concern is to give away confidential information)

shlomitk1 commented 2 years ago

Option 3: The storage in NL is currently unavailable because of some networking issues. Until they are resolved, the relevant storage account is not created.

shlomitk1 commented 2 years ago

Option 4: Eva wants to store the data that contains information about underage patients. The governance policy forces Eva to remove the problematic records. Eva's request fails because there is no ability to filter records. IT admin adds a new algorithm that allows this functionality, and the request is completed successfully.

simanadler commented 2 years ago

Option 1

Another potential scenario is:

There is an existing catalog entry, which is in geography B
User's workload is in geography A
User requests to write to the existing asset
Deny is received, due to a governance policy that says workloads in geography A cannot write to datasets in geography B
@shlomitk1 @ERES Would this make sense?

comments: Tables is s3 are immutable so the only additions that the user could do to an existing table are to use the "append" mode to add rows. I think the story would make sense if Eva (the data scientist) added columns or a new table to save her computations. Right?

The tables may be immutable but a dataset in the catalog doesn't necessarily refer to a single file in COS. I believe the dataset can refer to the bucket, and a new file can be written to the bucket - i.e. updating the dataset.

Option 2


Another potential scenario is the user is requesting to write to a type of storage (ex: dropbox) not supported by the enterprise . In that case the governance policy would relate to the connection received from the catalog.

User creates a catalog entry ahead of time that has a connection to dropbox
In FybrikApplication user indicates the catalog entry and that he wants to write to it
Governance policy says that if the connection is of type dropbox, then deny. (If we don’t know how to parse the connection to figure out it’s dropbox, take a shortcut for now and just put a tag on the catalog entry)
Fybrik should receive deny from the governance engine

This is a scenario that resonated with many people in the past.



**comments:** I think in a blog the IT admin creates the connections and it would not make sense if he will create a connection that is unsupported by the enterprise. right? **comments/questions (Flora):**

    1. Do we expect a data user to write a connection?
@flora177 If the data owner has an existing data asset or he wants to write one to a location of his choice then he is the one who puts the connection in the data catalog, so yes.

    2. Can we label a connection with "read only"
@flora177 Why?

    3. How do we write a policy regarding connection?
@flora177 It's information that we receive from the data catalog.  Why can't we write a policy for it?  Or, as @revit13 suggested to ease implementation tag the data asset in the data catalog as dropbox.

**Option 3**

The story begins where the [ING and IBM blog ](https://medium.com/fybrik/how-ing-and-ibm-are-collaborating-to-manage-enterprise-data-across-multiple-clouds-2f5d6d48963d) ends. There is multi-cluster setup with one cluster in the Netherlands and one in Turkey. Eva (is a data scientist) runs the notebook in the Netherlands. There is one object storage resides in Turkey and one in the Netherlands. There is a policy that dictates writing data close to the computation. However, the storage in the Netherlands is full according to the storage purchase plan. (??? @simanadler are there other reasons to **not** use the Netherlands storage in this case? having the implicit copy in the blog using the Netherlands storage in a similar case makes it hard to find reasons not to use it) Thus when Eva wants her calculation on the data to be saved in a new table, which contains for example average amount for each type of transaction, she gets an "error" result. She asks Tim the IT admin to extend the storage plan in the Netherlands and thus succeeds to write her data afterward.

Hmm, interesting idea but it's a bit complicated and requires two steps. I like option #2 better.

comments: the policy is not government policy but rather a hard config policy. Not sure that is a big issue

Option 4

Unrelated to ING and IBM blog

The scenario you probably want to demonstrate is:

a new asset has a sensitive field
Fybrik is deployed with storage in regions r1 and r2 (whatever their names are...)
A governance policy forbids to write sensitive data to regions r1 and r2

details:

There is multi cluster setup with one cluster in the Netherlands and one in Turkey. Eva (is a data scientist) runs the notebook with sensitive data in the Netherlands. (??? no details on how the data was ingested?) She wants to save her computations in a new table. There is one object storage that resides in Turkey and Greece. A governance policy forbids writing sensitive data to regions Turkey and Greece and thus her request to write a new table is denied. Eva consults with Tim the IT admin who decides to allocate new storage in the Netherlands. Eva re apply her request which now succeeds.

The problem with this scenario is that it will cause a big discussion about how we know there is a sensitive field. It's an important scenario in my opinion, but we would need to figure out how to run something that identifies sensitive data before deciding on where to write it. I have thoughts about how we could do that (ex: store in temporary storage, run a module that checks for sensitive data, and based on the results and the policy decisions copies to a final location) but I think we should start with a simpler example first.

In short, I vote for option #2 - the dropbox example.

revit13 commented 1 year ago

A blog on the write flow was published: https://medium.com/fybrik/governing-the-writing-of-data-with-fybrik-d3ba44dbe260