[Bug]: FileNotFoundError for existing dataset on windows only

JWuerfel commented 1 week ago

Issue checklist

[X] This is a bug in Foundry DevTools and not a bug in another project. It is also not an enhancement/feature request
[X] I searched through the GitHub issues and this issue has not been opened before.
[X] I have installed the latest version of Foundry DevTools and don't use an unsupported python version.

Description of the bug

Loading a certain dataset fails on windows. It only fails for a specific dataset which

I have all access rights to.
is not larger than other datasets that can be downloaded (and cached).
can be used on mac without issues.

The path in the error message exists/is created for the dataset in the cache folder up to spark, but the folder spark is empty. I have tried deleting the folder for the dataset as well as the entire cache but it does not change anything.

The error occurs whether the input path or RID is specified for the input.

Another user has had the same error in the past (also on windows) probably with a different dataset (although I have no more details).

Steps to reproduce this bug.

As it doesn't fail for all datasets and I don't know what makes the dataset it fails for different, it's difficult to descibe, but here is my code that fails for a specific dataset.

from transforms.api import Input, Output, transform_df

@transform_df(
    Output(
        "output_path"
    ),
    data=Input(
        "input_path"
    ),
)
def compute(data):
    return data

if __name__ == "__main__":
    df = compute.compute()
    df.show(100)

Log output

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\username\.foundry-dev-tools\.cache\foundry-dev-tools\dataset-RID\ri.foundry.main.transaction.1234.parquet\spark\part-1234.snappy.parquet'

Additional context

No response

Operating System

Windows

Your python version

3.10.15

jonas-w commented 5 days ago

Could you elaborate what you mean with "It only fails for a specific dataset which [...] is not larger than other datasets that can be downloaded (and cached)."

JWuerfel commented 3 days ago

The same code works perfectly with other datasets and it's just one that fails for me. We wondered if the dataset might be too large, but it works fine with bigger ones. This is for the dataset, I can't use:

208 columns • 1.3m rows • 41 files • 432MB These are others, that work:
237 columns • 15.2m rows • 6 files • 585MB
32 columns • 267m rows • 201 files • 3.8GB Unfortunately, I have no idea, what makes this dataset different from the others that worked.

emdgroup / foundry-dev-tools