PrefectHQ / prefect

Prefect is a workflow orchestration framework for building resilient data pipelines in Python.
https://prefect.io
Apache License 2.0
17.63k stars 1.65k forks source link

Feature Request: Masking of SecretStr Parameters in Flow Run Parameters UI #11679

Closed taylor-curran closed 9 months ago

taylor-curran commented 10 months ago

First check

Prefect Version

Version:             2.14.15
API version:         0.8.4
Python version:      3.11.7
Git commit:          e585e980
Built:               Thu, Jan 11, 2024 6:18 PM
OS/Arch:             darwin/arm64
Profile:             default
Server type:         cloud

Describe the current behavior

Currently, in Prefect 2 with the upgrade to pydantic v2, SecretStr parameters are not being masked in the flow run parameters within the UI. This leads to the exposure of sensitive data in the UI, which can be a significant risk for users handling confidential information.

Currently only the subflow secret parameters are obscured.

Describe the proposed behavior

The proposed behavior is to implement a feature where SecretStr parameters are automatically masked in the flow run parameters UI. This masking should apply consistently across both pydantic v1 and v2. The goal is to prevent accidental exposure of sensitive data in the UI, enhancing the security and privacy of flow run parameters.

Reproduction steps to see the change of behavior from Pydantic V1 to V2:

With Pydantic V1

my_flow.py

from pydantic import SecretStr  # pip install 'pydantic<2'

from prefect import flow

@flow
def subflow(secret: SecretStr) -> None:
    print(secret.get_secret_value())

@flow(log_prints=True)
def accepts_secret(secret: SecretStr) -> None:
    subflow(secret=secret)

if __name__ == "__main__":
    # accepts_secret.serve("Accepts Secret with Pydantic V2")
    accepts_secret.serve("Accepts Secret with Pydantic V1")
conda create --name pyd_v1 python=3.11
conda activate pyd_v1
pip install "pydantic<2"
pip install prefect
python my_flow.py

Flow Run URL Parent Flow image Sub Flow ❗ Value Obscured ✅ image

With Pydantic V2

my_flow.py

from pydantic import SecretStr 

from prefect import flow

@flow
def subflow(secret: SecretStr) -> None:
    print(secret.get_secret_value())

@flow(log_prints=True)
def accepts_secret(secret: SecretStr) -> None:
    subflow(secret=secret)

if __name__ == "__main__":
    accepts_secret.serve("Accepts Secret with Pydantic V2")
    # accepts_secret.serve("Accepts Secret with Pydantic V1")
conda create --name pyd_v2 python=3.11
conda activate pyd_v2
pip install prefect
python my_flow.py

Flow Run URL Parent Flow image

Sub Flow ❗ Value NOT obscured image

Example Use

An example use case would be when a user is running a flow that requires sensitive information. In the current behavior, these sensitive details are displayed in plain text in the UI. The proposed change would mask these details, similar to how passwords are masked in form inputs, ensuring that sensitive information is not exposed to anyone viewing the flow run details.

Additional context

This masking feature for SecretStr parameters was observed as an unintended behavior in earlier versions but was found to be beneficial by many users.

Related PR:

11656

jpidugu commented 10 months ago

Would love to see this implemented quickly if possible! Adding an example where this surfaces in case it's helpful:

Our kubernetes-based workflow consists of:

(a) a deployment that spins up a parent_flow in a k8s pod, where (b) the parent_flow represents one batch of work, and kicks off multiple instances of child_flow and waits for them to complete processing a set of entities.

We set things up in this way in order to reach our desired configuration that allows the system to scale performantly while best utilizing resources (we efficiently use the resources allocated to the pods, and don't need to scale the nodes in the cluster an unreasonable amount to avoid kubernetes pod-per-node limits).

We avoided sending sensitive data in the parent_flow parameters (by passing in essentially pointers to the sensitive data), to avoid things going over the wire when triggering the deployment. However, since each parent_flow is responsible for triggering a child_flow (that executes within the same pod), we decided that since the pod is local to our ecosystem, it would be safe to pass in resolved values (which includes sensitive strings) to child_flow. To illustrate:

@flow
async def parent_flow(s3_path, entity_start, entity_end):
    subflows = [
        child_flow(
            entity_id=s3_path[current_entity].get("id") # we do this to minimize s3 lookups
            for current_entity in range(entity_start, entity_end)
    ]

    await asyncio.gather(*subflows)

@flow
def child_flow(entity_id):  # this is now getting visualized in the UI
    do_some_work_with.submit(entity_id)

As shown in @taylor-curran's example above, we used to see entity_id as **********, and now see the sensitive value being unmasked. Changing the implementation here would actually have a bit of a drastic effect, since we would likely pay a significant overhead to use the pointer approach in the subflow.

zhen0 commented 10 months ago

Thanks for the issue @taylor-curran and the extra detail @jpidugu. Our strong recommendation is that any secret values should use a secret block. Although obfuscation may be a convenience, we do not want to give a false sense of security for something that does not offer the same level of security as a secret block does.