data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
228 stars 82 forks source link

Limiting Catalog search results based on Organizations #965

Open sandeephs1 opened 8 months ago

sandeephs1 commented 8 months ago

In catalog, we want to display search results limited to the organization user belongs to

Here is the scenario - Below table details, team, user, environment and associated organization <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Team | User | Environment | organization -- | -- | -- | -- mgm_usteam | mgm_u1 | mgm-music-us | mgm mgm_euteam | mgm_eu2 | mgm-sports-eu | mgm alexa_team | x_u1 | alexa-android | alexa

mgm and alexa are the 2 organization mgm has 2 environments, 8 dataset, 8 tables alexa has 2 envrionemnts, 6 dataset, 6 tables Below table details dataset and tables <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Organization | Environment | Dataset | Tables -- | -- | -- | -- mgm | mgm-music-us | mgm_music_eng | mgm_music_eng_alltimebest mgm | mgm-music-us | mgm_music_eng | mgm_music_eng_2023 mgm | mgm-music-us | mgm_music_esp | mgm_music_restofesp mgm | mgm-music-us | mgm_music_esp | mgm_music_esp_best23 mgm | mgm-sports-eu | mgm-sports-cricket | mgm-sports-cric_worldcup mgm | mgm-sports-eu | mgm-sports-cricket | mgm-sports-cric_ipl mgm | mgm-sports-eu | mgm-sports-football | mgm-sports-fb_wc mgm | mgm-sports-eu | mgm-sports-football | mgm-sports-fb_laliga alexa | alexa-android | alexa-android-jp | alx_anrd_jp_mgm alexa | alexa-android | alexa-android-jp | alx_anrd_jp_yt alexa | alexa-android | alexa-android-it | ale_droid_it_mgm alexa | alexa-android | alexa-android-it | ale_droid_it_yt alexa | lexa-wear | wear-os-events | events_music alexa | lexa-wear | wear-os-sensor | sensor_sports

When user 'mgm_u1' searches, returned results should not exceed (8 dataset + 8 tables)+searchcondition. He should not be displayed with the 'alexa' associated dataset and tables

similarly alexa user 'x_u1' should not be displayed with 'mgm' objects.

How this can be achieved

dlpzx commented 8 months ago

Hi @sandeephs1 this is a cool feature :) Basically you want a filtered version of the catalog based on the user's organizations. What is the motivation behind the feature? Do you need to restrict access to the metadata, or is it more of a usability problem?

sandeephs1 commented 8 months ago

Hi @dlpzx We have multiple products, our customers can subscribe to any of them.

customers -> Organization in Data.All product -> Environment in Data.All

Since multiple customers will be onboarded on "Data.All", should be able to meet the data governance. Currently Catalog search presents all the matching dataset/table irrespective of the Organization user belongs to, so it will be a databreach.

We want to avoid this situation by restricting catalog search results based on the "user-Organization" relation. In a way multi-tenancy feature

dlpzx commented 8 months ago

Hi @sandeephs1 thanks for the quick response! Based on your requirements, these are the high-level changes that we would need to implement:

  1. Add another field "organizationUri" in the data catalog, for that we need to modify the mappings of the OpenSearch index
  2. Add the information of the "organizationUri" to each item that is added to the catalog
  3. Backfill existing items (we need to investigate on the best approach)
  4. Modify the search API calls to filter by the user's organizations (list of "organizationUri")
  5. Introduce a configuration parameter to enable or disable "organization_data_catalog_isolation" and make this feature configurable in the code -- probably only the part of limiting the search api calls.

Do you have the bandwidth to implement this feature? We are happy to provide guidance and coaching throughout the process. We will consider it in our roadmap, but other features might be prioritized

sandeephs1 commented 8 months ago

Thanks @dlpzx we were also thinking in the similar lines, your input is definitely of help. we will implement this feature and update you

dlpzx commented 7 months ago

Updates [offline discussion]

@sandeephs1 and his team have a proposed implementation in which:

Remarks from data.all team

First of all, we want to highlight that the by-design-purpose of data.all was to work in a single-tenancy scenario. The way to achieve multi-tenancy in the most secure way would be by deploying data.all multiple times, one for each tenant.

But we understand and are interested in the multi-tenancy scenario that you present. We cannot implement this feature in the next couple of weeks, but we can work together on designs and are happy to guide you to contribute back. If the feature is part of the open-source repository we will manage and take ownership of bugs, issues and enhancements.

Here are some remarks that we have pointed out during our internal discussions:

Let's keep on working on this!