argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.93k stars 373 forks source link

[BUG-python] `FloatMetadataProperty`: value is not a valid `float` when it is an rounded integer #4570

Closed sdiazlor closed 8 months ago

sdiazlor commented 9 months ago

Describe the bug Using distilabel and to_argilla method raises this error regarding metadata and text-descriptives.

ValueError: `FeedbackRecord.metadata` {'length-input': 19, 'length-instructions': 106, 'generation-model': 
'gpt-3.5-turbo', 'input_n_tokens': 2, 'input_n_unique_tokens': 2, 'input_n_sentences': 1, 'input_perplexity': 1, 
'input_entropy': 0, 'input_flesch_reading_ease': -48.99, 'instructions_n_tokens': 18, 
'instructions_n_unique_tokens': 18, 'instructions_n_sentences': 1, 'instructions_perplexity': 1.79, 
'instructions_entropy': 0.58, 'instructions_flesch_reading_ease': 61.67} does not match the expected schema, with 
exception: 2 validation errors for MetadataSchema
input_perplexity
  value is not a valid float (type=type_error.float)
input_entropy
  value is not a valid float (type=type_error.float)

Expected behavior I would expect this check to pass given that any integer is a rounded of float.

Environment:

Additional context Add any other context about the problem here.

dosubot[bot] commented 9 months ago

🤖

The error you're encountering is due to the input_perplexity and input_entropy values in FeedbackRecord.metadata not being valid floats. This could be due to a few reasons:

  1. The values are not provided: In the _NumericMetadataPropertySchema class, the min and max values are optional. If these values are not provided, the _value_in_bounds method will raise a ValueError if the provided_value is not within the bounds of min and max.

  2. The values are not of the correct type: The _check_allowed_value_type method checks if the provided value is of type int or float. If it is not, a TypeError is raised.

  3. The values are NaN: The _check_nan method checks if the provided value is NaN (Not a Number). If it is, a ValueError is raised.

To fix this issue, ensure that the input_perplexity and input_entropy values in FeedbackRecord.metadata are valid floats and are within the bounds of min and max if they are provided. Also, make sure that these values are not NaN.

You can find more details in the metadata.py file.

As for the expected data type for input_perplexity and input_entropy in the FeedbackRecord.metadata schema, I wasn't able to find specific information within the repository. It's possible that this information is available elsewhere or I may have missed it. If you have more information about the expected data types or if there's a specific part of the code you're working with that could provide more context, please share it. This will help me provide a more accurate answer.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

davidberenstein1957 commented 9 months ago

@frascuchon, this might also be the case for IntMetaDataProperties working with 0 rounded numbers (1.0, 2.0, etc.)

frascuchon commented 8 months ago

Thanks @sdiazlor and @davidberenstein1957. If this is a client validation error, it should be an error related to the pydantic behaviour. Maybe @alvarobartt can take a look into it. He can tackle this better than me.

alvarobartt commented 8 months ago

May I ask how are the metadata_properties for the text-descriptives defined? Because it seems that the value is float which is a Python type that should go from float(1) -> 1.0, so I cannot see the issue with the provided information, could you answer the question above and provide any other information that can be useful? i.e. is the issue only within the to_argilla method of distilabel or also in argilla Python package?

AFAIK @davidberenstein1957 worked on that integration and I'm not really aware about the text-descriptives details, could you double check @davidberenstein1957? Thanks

davidberenstein1957 commented 8 months ago

@sdiazlor worked on this integration, I checked it. I think the issue is originating in the text-descriptives integration but is two-fold.