dscc-admin-ch / lomas

Lomas: The data oases hidden behind the mist. A platform for confidential analysis of private data.
https://arxiv.org/abs/2406.17087
MIT License
15 stars 4 forks source link

app.py does not check user for get_dataset_metadata #127

Closed 0f4d0335 closed 2 months ago

0f4d0335 commented 3 months ago

Problem

When querying with

from lomas_client import Client
import numpy as np

APP_URL = "http://localhost"
USER_NAME = "LOMASTEST"
DATASET_NAME = "FSO_INCOME_SYNTHETIC"
client = Client(url=APP_URL, user_name = USER_NAME, dataset_name = DATASET_NAME)
print(client)
print("")
penguin_metadata = client.get_dataset_metadata()
print(penguin_metadata)

I'm expecting UnauthorizedAccessException

            f"User {user_name} does not exist. "
            + "Please, verify the client object initialisation.",

But the result is a pass.

mongodb                    | {"t":{"$date":"2024-07-03T19:12:52.323+00:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn6","msg":"Authentication succeeded","attr":{"mechanism":"SCRAM-SHA-256","speculative":true,"principalName":"user","authenticationDatabase":"defaultdb","remote":"172.18.0.4:56232","extraInfo":{}}}
lomas_server_dev           | INFO:     172.18.0.1:32942 - "POST /get_dataset_metadata HTTP/1.1" 200 OK

When I change the database, however, an InvalidQueryException is caught.

lomas_server_dev           | 2024-07-03 19:17:11,649 - INFO -                 [error_handler.py:87 - invalid_query_exception_handler()] - InvalidQueryException raised: Dataset FSO_INCOME_SYNTHETIC-2 does not exists. Please, verify the client object initialisation.
lomas_server_dev           | INFO:     172.18.0.1:56470 - "POST /get_dataset_metadata HTTP/1.1" 400 Bad Request

After adding some print statements, it appears that dataset_must_exist is called

 @functools.wraps(func)
    def wrapper_decorator(
        self, *args: argparse.Namespace, **kwargs: Dict[str, str]
    ) -> None:
        dataset_name = args[0]
        print("test")
        if not self.does_dataset_exist(dataset_name):
            raise InvalidQueryException(
                f"Dataset {dataset_name} does not exists. "
                + "Please, verify the client object initialisation.",
            )
        return func(self, *args, **kwargs)

    return wrapper_decorator

and

    @dataset_must_exist
    def get_dataset_metadata(self, dataset_name: str) -> dict:
        """Returns the metadata dictionnary of the dataset.

        Wrapped by :py:func:`dataset_must_exist`.

        Args:
            dataset_name (str): name of the dataset to get the metadata

        Returns:
            dict: The metadata dict.
        """
        print("3")
        metadatas = self.db.metadata.find_one(
            {dataset_name: {"$exists": True}}
        )
        return metadatas[dataset_name]  # type: ignore

but user_must_exist is not called with the client query.


When I try the function call get_initial_budget, I get the expected error with the created user and Alice.

lomas_server_dev           | test
lomas_server_dev           | 3
lomas_server_dev           | INFO:     172.18.0.1:33734 - "POST /get_dummy_dataset HTTP/1.1" 200 OK
lomas_server_dev           | test2
lomas_server_dev           | 2024-07-03 20:12:36,655 - INFO -                 [error_handler.py:110 - unauthorized_access_exception_handler()] - UnauthorizedAccessException raised: User LOMASTESTTESTTEST does not exist. Please, verify the client object initialisation.
lomas_server_dev           | INFO:     172.18.0.1:35560 - "POST /get_initial_budget HTTP/1.1" 403 Forbidden
lomas_server_dev           | test2
lomas_server_dev           | 2024-07-03 20:13:07,163 - INFO -                 [error_handler.py:110 - unauthorized_access_exception_handler()] - UnauthorizedAccessException raised: Alice does not have access to FSO_INCOME_SYNTHETIC.
lomas_server_dev           | INFO:     172.18.0.1:46312 - "POST /get_initial_budget HTTP/1.1" 403 Forbidden

Proposed Resolution

When checking name of dataset, also check if user should access in admin_database.py

    @abstractmethod
    @dataset_must_exist
    def get_dataset_field(self, dataset_name: str, key: str) -> str:
        """
        Get dataset field type based on dataset name and key

        Wrapped by :py:func:`dataset_must_exist`.

        Args:
            dataset_name (str): Name of the dataset.
            key (str): Key for the value to get in the dataset dict.

        Returns:
            str: The requested value.
        """
PaulineMauryL commented 3 months ago

Hi @0f4d0335 thank you for the suggestion! It does not check the user because metadata are considered 'public information' (it is not a DP release and does not consume budget) and thus we assume anyone can check the metadata (even if they do not have access to do requests to the private data of the dataset in question). So the dataset must exist (otherwise there would not be any metadata) but the user is not a blocker here.

0f4d0335 commented 3 months ago

Hi @PaulineMauryL, understood. The metadata for the tables has been considered "public information." In the design it's assumed that anyone can query this. As a response, I will consider the GDPR and ISO for international standards for security architecture.

According to GDPR (General Data Protection Regulation) and ISO standards (specifically ISO/IEC 27001 and ISO/IEC 27002), the access to metadata of tables holding Personally Identifiable Information (PII) should be restricted and controlled based on the principle of data minimization and the need-to-know basis.

User check at query require the minimum identify requirement (currently there's no password authentication taken by the server). Therefore we assume shared credential to both the user-checked query and the unchecked query (blocked/unblocked).

Screenshot from 2024-07-04 08-20-58

I wrote out the two threats no password and lack of separation for account access. When deployed in an organization, Lomas relies on general trust that the users will not abuse their access to the server. However, additional controls can ensure organizational compliance to GDPR and ISO through the use of username and password checks. Lomas provides an identity check. Lomas should provide that identity check for all functions, so role-based controls can be enforced by organizations using Lomas.

If app.py could check the username for each query (even the metadata of tables that are considered "public information"), then role-based enforcement controls can be made by administrators to meet compliance and regulation requirements as they change. Today it may be fine for all users of the organization to see metadata, but in the future, it may not. Following the GDPR principle of data minimization and need-to-know separation of duties, it leads me to recommend that users are checked when database is checked from the client.

Internally app.py may need a separate check for metadata that the client cannot see without authorizing. This function should not leak data to the client.

PaulineMauryL commented 3 months ago

Hi @0f4d0335 thank you so much for the feedback and this explanation.

We will implement this additional check for the metadata (and other places where it is relevant).

Please, do not hesitate to tell us if you see anything else! It is highly appreciated.