argilla-io / argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
https://docs.argilla.io
Apache License 2.0
3.8k stars 356 forks source link

Cleanlab 2.2.0 version fails on `find_label_errors` function #1982

Closed frascuchon closed 1 year ago

frascuchon commented 1 year ago

Describe the bug Using the new cleanlab version, the find_label_error function fails. Still works with 2.0.0:

src/argilla/labeling/text_classification/label_errors.py:108: in find_label_errors
    indices = find_label_issues(s, psx, n_jobs=n_jobs, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

labels = array([list([0]), list([0, 1]), list([0]), list([0, 1]), list([0]),
       list([0, 1]), list([0]), list([0, 1]), list([0]), list([0, 1]),
       list([0]), list([0, 1])], dtype=object)
pred_probs = array([[0.1 , 0.9 ],
       [0.01, 0.9 ],
       [0.1 , 0.9 ],
       [0.01, 0.9 ],
       [0.1 , 0.9 ],
       [0.01,...[0.1 , 0.9 ],
       [0.01, 0.9 ],
       [0.1 , 0.9 ],
       [0.01, 0.9 ],
       [0.1 , 0.9 ],
       [0.01, 0.9 ]])

    def find_label_issues(
        labels: LabelLike,
        pred_probs: np.ndarray,
        *,
        return_indices_ranked_by: Optional[str] = None,
        rank_by_kwargs: Optional[Dict[str, Any]] = None,
        filter_by: str = "prune_by_noise_rate",
        multi_label: bool = False,
        frac_noise: float = 1.0,
        num_to_remove_per_class: Optional[int] = None,
        min_examples_per_class=1,
        confident_joint: Optional[np.ndarray] = None,
        n_jobs: Optional[int] = None,
        verbose: bool = False,
    ) -> np.ndarray:
        """
        Identifies potentially bad labels in a classification dataset using confident learning.

        Returns a boolean mask for the entire dataset where ``True`` represents
        an example identified with a label issue and ``False`` represents an example that seems correctly labeled.

        Instead of a mask, you can obtain indices of the examples with label issues in your dataset
        (sorted by issue severity) by specifying the `return_indices_ranked_by` argument.
        This determines which label quality score is used to quantify severity,
        and is useful to view only the top-`J` most severe issues in your dataset.

        The number of indices returned as issues is controlled by `frac_noise`: reduce its
        value to identify fewer label issues. If you aren't sure, leave this set to 1.0.

        Tip: if you encounter the error "pred_probs is not defined", try setting
        ``n_jobs=1``.

        Parameters
        ----------
        labels : np.ndarray or list
          A discrete vector of noisy labels for a classification dataset, i.e. some labels may be erroneous.
          *Format requirements*: for dataset with K classes, each label must be integer in 0, 1, ..., K-1.
          For a standard (multi-class) classification dataset where each example is labeled with one class,
          `labels` should be 1D array of shape ``(N,)``, for example: ``labels = [1,0,2,1,1,0...]``.
          For a multi-label classification dataset where each example can belong to multiple (or no) classes,
          `labels` should be an iterable of iterables (e.g. ``List[List[int]]``) whose i-th element corresponds to list of classes that i-th example belongs to (e.g. ``labels = [[1,2],[1],[0],[],...]``).

        pred_probs : np.ndarray, optional
          An array of shape ``(N, K)`` of model-predicted class probabilities,
          ``P(label=k|x)``. Each row of this matrix corresponds
          to an example `x` and contains the model-predicted probabilities that
          `x` belongs to each possible class, for each of the K classes. The
          columns must be ordered such that these probabilities correspond to
          class 0, 1, ..., K-1.

          **Note**: Returned label issues are most accurate when they are computed based on out-of-sample `pred_probs` from your model.
          To obtain out-of-sample predicted probabilities for every datapoint in your dataset, you can use :ref:`cross-validation <pred_probs_cross_val>`.
          This is encouraged to get better results.

        return_indices_ranked_by : {None, 'self_confidence', 'normalized_margin', 'confidence_weighted_entropy'}, default=None
          Determines what is returned by this method: either a boolean mask or list of indices np.ndarray.
          If ``None``, this function returns a boolean mask (``True`` if example at index is label error).
          If not ``None``, this function returns a sorted array of indices of examples with label issues
          (instead of a boolean mask). Indices are sorted by label quality score which can be one of:

          - ``'normalized_margin'``: ``normalized margin (p(label = k) - max(p(label != k)))``
          - ``'self_confidence'``: ``[pred_probs[i][labels[i]] for i in label_issues_idx]``
          - ``'confidence_weighted_entropy'``: ``entropy(pred_probs) / self_confidence``

        rank_by_kwargs : dict, optional
          Optional keyword arguments to pass into scoring functions for ranking by
          label quality score (see :py:func:`rank.get_label_quality_scores
          <cleanlab.rank.get_label_quality_scores>`).

        filter_by : {'prune_by_class', 'prune_by_noise_rate', 'both', 'confident_learning', 'predicted_neq_given'}, default='prune_by_noise_rate'
          Method to determine which examples are flagged as having label issue, so you can filter/prune them from the dataset. Options:

          - ``'prune_by_noise_rate'``: filters examples with *high probability* of being mislabeled for every non-diagonal in the confident joint (see `prune_counts_matrix` in `filter.py`). These are the examples where (with high confidence) the given label is unlikely to match the predicted label for the example.
          - ``'prune_by_class'``: filters the examples with *smallest probability* of belonging to their given class label for every class.
          - ``'both'``: filters only those examples that would be filtered by both ``'prune_by_noise_rate'`` and ``'prune_by_class'``.
          - ``'confident_learning'``: filters the examples counted as part of the off-diagonals of the confident joint. These are the examples that are confidently predicted to be a different label than their given label.
          - ``'predicted_neq_given'``: filters examples for which the predicted class (i.e. argmax of the predicted probabilities) does not match the given label.

        multi_label : bool, optional
          If ``True``, labels should be an iterable (e.g. list) of iterables, containing a
          list of class labels for each example, instead of just a single label.
          The multi-label setting supports classification tasks where an example can belong to more than 1 class or none of the classes (rather than exactly one class as in standard multi-class classification).
          Example of a multi-labeled `labels` input: ``[[0,1], [1], [0,2], [0,1,2], [0], [1], [], ...]``. This says the first example in dataset belongs to both class 0 and class 1, according to its given label.
          Each row of `pred_probs` no longer needs to sum to 1 in multi-label settings, since one example can now belong to multiple classes simultaneously.

        frac_noise : float, default=1.0
          Used to only return the "top" ``frac_noise * num_label_issues``. The choice of which "top"
          label issues to return is dependent on the `filter_by` method used. It works by reducing the
          size of the off-diagonals of the `joint` distribution of given labels and true labels
          proportionally by `frac_noise` prior to estimating label issues with each method.
          This parameter only applies for `filter_by=both`, `filter_by=prune_by_class`, and
          `filter_by=prune_by_noise_rate` methods and currently is unused by other methods.
          When ``frac_noise=1.0``, return all "confident" estimated noise indices (recommended).

          frac_noise * number_of_mislabeled_examples_in_class_k.
          Note: specifying `frac_noise` is not yet supported if `multi_label` is True.

        num_to_remove_per_class : array_like
          An iterable of length K, the number of classes.
          E.g. if K = 3, ``num_to_remove_per_class=[5, 0, 1]`` would return
          the indices of the 5 most likely mislabeled examples in class 0,
          and the most likely mislabeled example in class 2.

          Note
          ----
          Only set this parameter if ``filter_by='prune_by_class'``.
          You may use with ``filter_by='prune_by_noise_rate'``, but
          if ``num_to_remove_per_class=k``, then either k-1, k, or k+1
          examples may be removed for any class due to rounding error. If you need
          exactly 'k' examples removed from every class, you should use
          ``filter_by='prune_by_class'``.

        min_examples_per_class : int, default=1
          Minimum number of examples per class to avoid flagging as label issues.
          This is useful to avoid deleting too much data from one class
          when pruning noisy examples in datasets with rare classes.

        confident_joint : np.ndarray, optional
          An array of shape ``(K, K)`` representing the confident joint, the matrix used for identifying label issues, which
          estimates a confident subset of the joint distribution of the noisy and true labels, ``P_{noisy label, true label}``.
          Entry ``(j, k)`` in the matrix is the number of examples confidently counted into the pair of ``(noisy label=j, true label=k)`` classes.
          The `confident_joint` can be computed using :py:func:`count.compute_confident_joint <cleanlab.count.compute_confident_joint>`.
          If not provided, it is computed from the given (noisy) `labels` and `pred_probs`.
          If `multi_label` is True, `confident_joint` should instead be a one-vs-rest array with shape ``(K, 2, 2)`` as returned by :py:func:`count.compute_confident_joint <cleanlab.count.compute_confident_joint>` function.

        n_jobs : optional
          Number of processing threads used by multiprocessing. Default ``None``
          sets to the number of cores on your CPU.
          Set this to 1 to *disable* parallel processing (if its causing issues).
          Windows users may see a speed-up with ``n_jobs=1``.

        verbose : optional
          If ``True``, prints when multiprocessing happens.

        Returns
        -------
        label_issues : np.ndarray
          If `return_indices_ranked_by` left unspecified, returns a boolean **mask** for the entire dataset
          where ``True`` represents a label issue and ``False`` represents an example that is
          accurately labeled with high confidence.
          If `return_indices_ranked_by` is specified, returns a shorter array of **indices** of examples identified to have
          label issues (i.e. those indices where the mask would be ``True``), sorted by likelihood that the corresponding label is correct.

          Note
          ----
          Obtain the *indices* of examples with label issues in your dataset by setting `return_indices_ranked_by`.
        """
        if not rank_by_kwargs:
            rank_by_kwargs = {}

        assert filter_by in [
            "prune_by_noise_rate",
            "prune_by_class",
            "both",
            "confident_learning",
            "predicted_neq_given",
        ]  # TODO: change default to confident_learning ?
        allow_one_class = False
        if isinstance(labels, np.ndarray) or all(isinstance(lab, int) for lab in labels):
>           if set(labels) == {0}:  # occurs with missing classes in multi-label settings
E           TypeError: unhashable type: 'list'

To Reproduce Steps to reproduce the behavior:

  1. Install cleanlab 2.2.0
  2. Launch the test tests/labeling/text_classification/test_label_errors.py::test_multi_label_warning

Expected behavior

Screenshots

Environment (please complete the following information):

Additional context Add any other context about the problem here.

ufukhurriyetoglu commented 1 year ago

@frascuchon the merged PR closes this issue, right ?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 30 days since being marked as stale.