[BUG-python/deployment] `KeyError: 'text'` when running `metric.compute("alpha")` on a `FeedbackDataset`

mpjuhasz commented 3 months ago

Describe the bug I'm trying to run the agreement metrics on a FeedbackDataset. When running metric.compute("alpha"), I'm seeing the below issue -- my dataset has no "text" field. Updating the failing line (argilla/client/feedback/metrics/agreement_metrics.py:112) to be

question_text = json.loads(row["metadata"])["q_id"]

fixes the issue (or at least it runs and produces some result). From the context I understood that the question_text is to be used as a question id. Now, my dataset has a question id in the metadata which I'm using above, but there was no way to instruct argilla to use this.

Stacktrace and Code to create the bug

src/dataset_creation/annotation_dashboard/iaa.py:115: in argilla_iaa
    metrics_report = metric.compute("alpha")
.venv/lib/python3.11/site-packages/argilla/client/feedback/metrics/agreement_metrics.py:232: in compute
    dataset = prepare_dataset_for_annotation_task(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

dataset = FeedbackDataset(
   fields=[TextField(name='question', title='Question / query to the system', required=True, type=<Fi...stionTypes.text: 'text'>, use_markdown=False)]
   guidelines=None)
   metadata_properties=[])
   vectors_settings=[])
)
question_name = 'toxicity', filter_by = None, sort_by = None, max_records = None

    def prepare_dataset_for_annotation_task(
        dataset: Union["FeedbackDataset", "RemoteFeedbackDataset"],
        question_name: str,
        filter_by: Optional[Dict[str, Union["ResponseStatusFilter", List["ResponseStatusFilter"]]]] = None,
        sort_by: Optional[List["SortBy"]] = None,
        max_records: Optional[int] = None,
    ) -> "FormattedResponses":
        """Helper function to prepare the dataset for the nltk's AnnotationTask.

        The AnnotationTask class from nltk expects the data to be formatted as a list
        of tuples, each containing the annotator id, the task id and the label.

        The AnnotationTask is supposed to deal with sets and hashable objects, but
        there are errors transforming the data to sets and using the MASI distance function.
        For the moment what we do with those type of questions is create a string
        with all the values, and use the edit distance function.

        Note:
            We could potentially extend the functionality to a more than a question name. The
            requirement would be that all the questions are of the same type, as that would
            determine the type of distance function to use.

        Args:
            dataset: FeedbackDataset to compute the metrics.
            question_name: Name of the question for which we want to analyse the agreement.
            filter_by: A dict with key the field to filter by, and values the filters to apply.
                Can be one of: draft, pending, submitted, and discarded. If set to None,
                no filter will be applied. Defaults to None (no filter is applied).
            sort_by: A list of `SortBy` objects to sort your dataset by.
                Defaults to None (no filter is applied).
            max_records: The maximum number of records to use for training. Defaults to None.

        Returns:
            formatted_responses: The responses formatted as a list of tuples of (user_id, question_id, value).
        """
        question_type = type(dataset.question_by_name(question_name))
        # Check to assume the remote questions behave just like local ones
        if issubclass(question_type, RemoteSchema):
            question_type = type(dataset.question_by_name(question_name).to_local())

        supported_question_types = list(QUESTION_TO_DISTANCE.keys())
        if question_type not in supported_question_types:
            raise NotImplementedError(
                f"Question '{question_name}' is of type '{question_type}', the supported question types are: {supported_question_types}."
            )

        if filter_by:
            dataset = dataset.filter_by(**filter_by)
        if sort_by:
            dataset = dataset.sort_by(sort_by)
        if max_records:
            dataset = dataset.pull(max_records=max_records)

        hf_dataset = dataset.format_as("datasets")

        formatted_responses: FormattedResponses = []

        for row in hf_dataset:
            responses_ = row[question_name]
>           question_text = row["text"]
E           KeyError: 'text'

.venv/lib/python3.11/site-packages/argilla/client/feedback/metrics/agreement_metrics.py:112: KeyError

Expected behavior User is able to provide a unique id of the question for this step of the process, or a unique id is generated based on the fields available (not necessarily "text").

Environment:

Argilla version: 1.29.0
Python version: 3.11.4
datasets version: 2.20.0

Additional context It might be relevant (but based on my understanding not impacting the behaviour), that the FeedbackDataset is in fact pulled from the Huggingface hub when this occurs.

sdiazlor commented 3 months ago

Hi @mpjuhasz, thank you for reporting. Krippendorff's Alpha is calculated based on the responses, so fields should not be affected. For instance, this is a practical example to calculate it. What type of question are you using?

mpjuhasz commented 3 months ago

@sdiazlor, thanks for the quick response. Yes, that's what I was expecting, yet it seems to look for this "text" field. I'm using label_selection type questions. Here's an example questions from the argilla.yaml:

- description: null
  id: <uuid>
  labels:
    DONT_KNOW: Don't know
    'NO': 'No'
    'YES': 'Yes'
  name: toxicity
  required: true
  title: Does the response contain any toxic, harmful, or inappropriate content?
  type: label_selection
  visible_labels: null

sdiazlor commented 3 months ago

@mpjuhasz Thank you, I'll check it.

davidberenstein1957 commented 2 months ago

@nataliaElv , potential bug fiz that needs to happen

frascuchon commented 2 months ago

@plaguss any ideas on what's happening?

sdiazlor commented 2 months ago

@frascuchon @plaguss As far as I could check, when converted to Dataset, the columns take the field name. if the field name is not "text", it won't match, raising the error:

    hf_dataset = dataset.format_as("datasets")

    formatted_responses: FormattedResponses = []

    for row in hf_dataset:
        responses_ = row[question_name]
        question_text = row["text"]
        for response in responses_:
            user_id = response["user_id"]
            if user_id is None:
                raise ValueError(
                    "Please push your dataset to argilla to have the user_id necessary for this computation."
                )

I'll work on this one.

plaguss commented 2 months ago

Hi @frascuchon @sdiazlor, I was reviewing the function and that's supposed to generate a data structure to be ingested by this class: nltk's AnnotationTask. The "question_text" corresponds to the "question_id" to be annotated in this task. I reviewed the tests from the corresponding PR (https://github.com/argilla-io/argilla/pull/4271/files#diff-50ab5090f045fe6dbc539fc2d511c315b7d0d248006198d0257ccc6718e5663cR88), and apparently the key "text" was used as that was the name used in the questions used for the sample dataset in the tests:

@pytest.fixture
def feedback_dataset_fields() -> List["AllowedFieldTypes"]:
    return [
        TextField(name="text", required=True),
        TextField(name="label", required=True),
    ]

I cannot find a reason for why the "text" is hardcoded there really, but it could be an optional argument of the function with that name as default, or as Sara mentions, if the field is not used for the alpha case, it could be filled with a placeholder.

argilla-io / argilla

[BUG-python/deployment] `KeyError: 'text'` when running `metric.compute("alpha")` on a `FeedbackDataset` #5123