emdgroup / foundry-dev-tools

Foundry DevTools
https://emdgroup.github.io/foundry-dev-tools/
Apache License 2.0
115 stars 23 forks source link

[Bug]: FileNotFoundError for existing dataset on windows only #84

Open JWuerfel opened 1 week ago

JWuerfel commented 1 week ago

Issue checklist

Description of the bug

Loading a certain dataset fails on windows. It only fails for a specific dataset which

The path in the error message exists/is created for the dataset in the cache folder up to spark, but the folder spark is empty. I have tried deleting the folder for the dataset as well as the entire cache but it does not change anything.

The error occurs whether the input path or RID is specified for the input.

Another user has had the same error in the past (also on windows) probably with a different dataset (although I have no more details).

Steps to reproduce this bug.

As it doesn't fail for all datasets and I don't know what makes the dataset it fails for different, it's difficult to descibe, but here is my code that fails for a specific dataset.

from transforms.api import Input, Output, transform_df

@transform_df(
    Output(
        "output_path"
    ),
    data=Input(
        "input_path"
    ),
)
def compute(data):
    return data

if __name__ == "__main__":
    df = compute.compute()
    df.show(100)

Log output

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\username\.foundry-dev-tools\.cache\foundry-dev-tools\dataset-RID\ri.foundry.main.transaction.1234.parquet\spark\part-1234.snappy.parquet'

Additional context

No response

Operating System

Windows

Your python version

3.10.15

jonas-w commented 5 days ago

Could you elaborate what you mean with "It only fails for a specific dataset which [...] is not larger than other datasets that can be downloaded (and cached)."

JWuerfel commented 3 days ago

The same code works perfectly with other datasets and it's just one that fails for me. We wondered if the dataset might be too large, but it works fine with bigger ones. This is for the dataset, I can't use: