flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.45k stars 583 forks source link

[BUG] Using HashMethod annotation for numpy arrays fails #5630

Closed demmerichs closed 1 month ago

demmerichs commented 1 month ago

Describe the bug

Following your guide for caching of tasks based on non-literal inputs, but replacing the dataframe in the guide simply with an numpy array, leads to an error:

Error encountered while executing 'wf':
  Failed to convert outputs of task 'foo' at position 0:
  typing.Annotated[numpy.ndarray, <flytekit.core.hash.HashMethod object at 0xFFFFFFFFF>]'s metadata needs to be of type kwtypes.

I used the following hash function:

import hashlib
import numpy as np
def hash_array(arr: np.ndarray) -> str:
    return hashlib.sha1(arr.tobytes()).hexdigest()

and used it as

@task
def foo(  # noqa: F811
    a: int,
    b: str,  # noqa: F821
) -> Annotated[np.ndarray, HashMethod(hash_array)]:  # noqa: F821  # noqa: F821
    arr = np.array([a])  # noqa: F821
    return arr

Expected behavior

Successfully produce a caching of the bar task and a cache hit in subsequent runs.

Additional context to reproduce

I already stepped a bit through the flytekit code with a debugger and found its coming down to numpy arrays having a special transformer type logic in flytekit.

The hashing value is produced successfully internally, but a few code lines down the line hit the to_literal function of NumpyArrayTransformer, which seems to be unable to handle annotated types.

It seems dataframes from pandas don't get this special treatment. Maybe you could also clarify here in this issue for me, why we need or want a special treatment for np arrays when it is still moving the data to disk and only stores a filepath as meta data of the output. Isn't that how all compound data structures are handled by default, e.g. offloading them to the disk and only storing a file pointer?

Screenshots

No response

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

welcome[bot] commented 1 month ago

Thank you for opening your first issue here! šŸ› 

runllm[bot] commented 1 month ago

The issue you're experiencing with using HashMethod for numpy arrays is likely due to the special handling of numpy arrays in Flytekit's type system. According to the Flytekit documentation, the HashMethod annotation is used to control memoization behavior for non-literal inputs like pandas dataframes. However, numpy arrays have a specialized transformer logic in Flytekit, which might not fully support annotated types.

To address this, you might need to implement a custom transformer for numpy arrays that can handle the HashMethod annotation. This involves extending the StructuredDatasetEncoder and StructuredDatasetDecoder classes to include your custom hashing logic.

Would you like more details on how to implement a custom transformer for numpy arrays in Flytekit?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

demmerichs commented 1 month ago

@RunLLM Sadly most of your comment is repeating my arguments. Regarding writing a custom transformer, I think that because flyte is already providing a special transformer for numpy arrays and logic for a hash method, this error should be resolved on fltye's side instead of asking users to fix it themselves with yet another custom transformer. But could you perhaps elaborate, why flyte has a customer transformer for numpy arrays in the first place?

runllm[bot] commented 1 month ago

Flyte has a custom transformer for numpy arrays to efficiently handle their serialization and deserialization, ensuring compatibility with Flyte's type system and storage mechanisms. This transformer, NumpyArrayTransformer, converts numpy arrays to a file format (e.g., .npy) and manages their storage and retrieval from remote locations.

The error you're encountering arises because the NumpyArrayTransformer does not currently support handling annotated types with HashMethod. This is a limitation in the current implementation.

Would you like more details on how to potentially modify the existing NumpyArrayTransformer to support HashMethod annotations?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!

demmerichs commented 1 month ago

@RunLLM How does flyte handle data structures with no specialized transformers like pandas dataframes compared to the customized handling of numpy arrays?

runllm[bot] commented 1 month ago

Flyte handles data structures without specialized transformers, like pandas dataframes, by using a more generic approach. For instance, pandas dataframes are typically serialized to a file format (e.g., Parquet) and stored in a remote location, with only the file path being stored as metadata. This is managed by the StructuredDataset class, which provides a flexible way to handle various data formats and storage backends.

In contrast, numpy arrays have a specialized transformer, NumpyArrayTransformer, which directly handles the serialization to .npy files and manages their storage and retrieval. This specialized handling ensures efficient processing and compatibility with Flyte's type system but currently lacks support for annotations like HashMethod.

Would you like more details on how to potentially modify the existing NumpyArrayTransformer to support HashMethod annotations?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a šŸ‘ if this is helpful and šŸ‘Ž if it is not.

Mention @runllm to ask follow-up questions. See runllm.com for more!