dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.14k stars 1.4k forks source link

"/" in partition key causes unexpected issues #21333

Open daviddemeij opened 4 months ago

daviddemeij commented 4 months ago

Dagster version

dagster, version 1.5.14

What's the issue?

When we tried using a /vsigs/ path (i.e. /vsigs/bucket_name/obj_a) as a partition key this resulted in an unexpected hard-to-debug error (shortened):

dagster._core.errors.DagsterExecutionHandleOutputError: Error occurred while handling output "result" of step "asset_a":
...
The above exception was caused by the following exception:
PermissionError: [Errno 13] Permission denied: '/vsigs'
...
FileNotFoundError: [Errno 2] No such file or directory: '/vsigs/bucket_name/'
...
The above exception occurred during handling of the following exception:
FileNotFoundError: [Errno 2] No such file or directory: '/vsigs/bucket_name/objects/'
...
The above exception occurred during handling of the following exception:
FileNotFoundError: [Errno 2] No such file or directory: '/vsigs/bucket_name/objects/a'

What did you expect to happen?

I expected the partitions to be created, maybe a better error can be raised to make it clear you should not use / in partition key names.

How to reproduce?

Create a partition where the name of the partition key includes a / (i.e. /vsigs/bucket/objects/a)

Deployment type

Local

Deployment details

No response

Additional information

I've solved this by replacing the / with a placeholder.

Ideally, it would not be possible to use a / at all in the partition name, as I understand the / is used as a delimiter for multi-level partitions, from my colleague:

The problem seems to be slashes in partition key names, which I think makes sense because multi-partition key defs, for example, are often called as / (i.e. the slash is a delimiter in the partition key representation of Dagster).

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

jamiedemaria commented 4 months ago

hey @daviddemeij sorry for the delayed response here. What I/O manager are you using? it seems like the underlying issue is that the I/O manager doesn't have permission to create directories in whatever filesystem you are trying to store your data. I tried to replicate with the default I/O manager, but was able to successfully materialize the asset

daviddemeij commented 4 months ago

hey @daviddemeij sorry for the delayed response here. What I/O manager are you using? it seems like the underlying issue is that the I/O manager doesn't have permission to create directories in whatever filesystem you are trying to store your data. I tried to replicate with the default I/O manager, but was able to successfully materialize the asset

I was just using the default I/O manager.

I probably did not explain the issue properly, the issue is the fact that when you add / to a partition key it gets interpreted as a path. But I just wanted to use a reference to an external path as the partition key.

The partition key was referring to a /vsigs/ path (A path on GCS) but the IO manager was trying to access this path locally (which it obviously can't).

I solved the issue by replacing "/" with "2%F" in the partition keys.

jamiedemaria commented 4 months ago

the IO manager was trying to access this path locally (which it obviously can't)

Is this because your assets are actually being stored somewhere else?

daviddemeij commented 4 months ago

the IO manager was trying to access this path locally (which it obviously can't)

Is this because your assets are actually being stored somewhere else?

No if I use a path that starts with /vsigs/ as a partition key it doesn't create the asset at all, I get an error like:

OSError: [Errno 30] Read-only file system: '/vsigs'

I found another solution is to remove the first part of the filepath (/vsigs/bucket_name/)