Closed demmerichs closed 1 month ago
Thank you for opening your first issue here! š
The issue you're experiencing with using HashMethod
for numpy arrays is likely due to the special handling of numpy arrays in Flytekit's type system. According to the Flytekit documentation, the HashMethod
annotation is used to control memoization behavior for non-literal inputs like pandas dataframes. However, numpy arrays have a specialized transformer logic in Flytekit, which might not fully support annotated types.
To address this, you might need to implement a custom transformer for numpy arrays that can handle the HashMethod
annotation. This involves extending the StructuredDatasetEncoder
and StructuredDatasetDecoder
classes to include your custom hashing logic.
Would you like more details on how to implement a custom transformer for numpy arrays in Flytekit?
Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a š if this is helpful and š if it is not.
Mention @runllm to ask follow-up questions. See runllm.com for more!
@RunLLM Sadly most of your comment is repeating my arguments. Regarding writing a custom transformer, I think that because flyte is already providing a special transformer for numpy arrays and logic for a hash method, this error should be resolved on fltye's side instead of asking users to fix it themselves with yet another custom transformer. But could you perhaps elaborate, why flyte has a customer transformer for numpy arrays in the first place?
Flyte has a custom transformer for numpy arrays to efficiently handle their serialization and deserialization, ensuring compatibility with Flyte's type system and storage mechanisms. This transformer, NumpyArrayTransformer
, converts numpy arrays to a file format (e.g., .npy
) and manages their storage and retrieval from remote locations.
The error you're encountering arises because the NumpyArrayTransformer
does not currently support handling annotated types with HashMethod
. This is a limitation in the current implementation.
Would you like more details on how to potentially modify the existing NumpyArrayTransformer
to support HashMethod
annotations?
Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a š if this is helpful and š if it is not.
Mention @runllm to ask follow-up questions. See runllm.com for more!
@RunLLM How does flyte handle data structures with no specialized transformers like pandas dataframes compared to the customized handling of numpy arrays?
Flyte handles data structures without specialized transformers, like pandas dataframes, by using a more generic approach. For instance, pandas dataframes are typically serialized to a file format (e.g., Parquet) and stored in a remote location, with only the file path being stored as metadata. This is managed by the StructuredDataset
class, which provides a flexible way to handle various data formats and storage backends.
In contrast, numpy arrays have a specialized transformer, NumpyArrayTransformer
, which directly handles the serialization to .npy
files and manages their storage and retrieval. This specialized handling ensures efficient processing and compatibility with Flyte's type system but currently lacks support for annotations like HashMethod
.
Would you like more details on how to potentially modify the existing NumpyArrayTransformer
to support HashMethod
annotations?
Sources:
This is an AI-generated response and your feedback is appreciated! Please leave a š if this is helpful and š if it is not.
Mention @runllm to ask follow-up questions. See runllm.com for more!
Describe the bug
Following your guide for caching of tasks based on non-literal inputs, but replacing the dataframe in the guide simply with an numpy array, leads to an error:
I used the following hash function:
and used it as
Expected behavior
Successfully produce a caching of the
bar
task and a cache hit in subsequent runs.Additional context to reproduce
I already stepped a bit through the flytekit code with a debugger and found its coming down to numpy arrays having a special transformer type logic in flytekit.
The hashing value is produced successfully internally, but a few code lines down the line hit the
to_literal
function ofNumpyArrayTransformer
, which seems to be unable to handle annotated types.It seems dataframes from pandas don't get this special treatment. Maybe you could also clarify here in this issue for me, why we need or want a special treatment for np arrays when it is still moving the data to disk and only stores a filepath as meta data of the output. Isn't that how all compound data structures are handled by default, e.g. offloading them to the disk and only storing a file pointer?
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?