MaartenGr / Concept

Concept Modeling: Topic Modeling on Images and Text
https://maartengr.github.io/Concept/
MIT License
187 stars 16 forks source link

Pandas key error during model fitting #14

Closed amrakm closed 1 year ago

amrakm commented 1 year ago

I tried the demo code and it worked for a small sample, tried to feed it more images and I got this error KeyError: '[-1] not found in axis'

dependencies: concept=='0.2.1' pandas=1.4.0

/home/<username>/anaconda3/envs/rd38/lib/python3.8/site-packages/torchvision/transforms/transforms.py:332: UserWarning: Argument 'interpolation' of type int is deprecated since 0.13 and will be removed in 0.15. Please use InterpolationMode enum.
  warnings.warn(
100%|███████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:21<00:00,  1.06s/it]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [30], in <cell line: 3>()
      1 from concept import ConceptModel
      2 concept_model = ConceptModel()
----> 3 concepts = concept_model.fit_transform(img_names[3500:6000])

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/concept/_model.py:124, in ConceptModel.fit_transform(self, images, docs, image_names, image_embeddings)
    122 # Reduce dimensionality and cluster images into concepts
    123 reduced_embeddings = self._reduce_dimensionality(image_embeddings)
--> 124 predictions = self._cluster_embeddings(reduced_embeddings)
    126 # Extract representative images through exemplars
    127 representative_images = self._extract_exemplars(image_names)

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/concept/_model.py:261, in ConceptModel._cluster_embeddings(self, embeddings)
    257 self.cluster_labels = sorted(list(set(self.hdbscan_model.labels_)))
    258 predicted_clusters = list(self.hdbscan_model.labels_)
    260 self.frequency = (
--> 261     pd.DataFrame({"Cluster": predicted_clusters, "Count": predicted_clusters})
    262       .groupby("Cluster")
    263       .count()
    264       .drop(-1)
    265       .sort_values("Count", ascending=False)
    266 )
    267 return predicted_clusters

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/pandas/core/frame.py:4956, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4808 @deprecate_nonkeyword_arguments(version=None, allowed_args=["self", "labels"])
   4809 def drop(
   4810     self,
   (...)
   4817     errors: str = "raise",
   4818 ):
   4819     """
   4820     Drop specified labels from rows or columns.
   4821 
   (...)
   4954             weight  1.0     0.8
   4955     """
-> 4956     return super().drop(
   4957         labels=labels,
   4958         axis=axis,
   4959         index=index,
   4960         columns=columns,
   4961         level=level,
   4962         inplace=inplace,
   4963         errors=errors,
   4964     )

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/pandas/core/generic.py:4279, in NDFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   4277 for axis, labels in axes.items():
   4278     if labels is not None:
-> 4279         obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4281 if inplace:
   4282     self._update_inplace(obj)

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/pandas/core/generic.py:4323, in NDFrame._drop_axis(self, labels, axis, level, errors, consolidate, only_slice)
   4321         new_axis = axis.drop(labels, level=level, errors=errors)
   4322     else:
-> 4323         new_axis = axis.drop(labels, errors=errors)
   4324     indexer = axis.get_indexer(new_axis)
   4326 # Case for non-unique axis
   4327 else:

File ~/anaconda3/envs/rd38/lib/python3.8/site-packages/pandas/core/indexes/base.py:6644, in Index.drop(self, labels, errors)
   6642 if mask.any():
   6643     if errors != "ignore":
-> 6644         raise KeyError(f"{list(labels[mask])} not found in axis")
   6645     indexer = indexer[~mask]
   6646 return self.delete(indexer)

KeyError: '[-1] not found in axis'
MaartenGr commented 1 year ago

Apologies for the late reply! It seems that there were no outliers found, which happens very rarely. I'll make sure that it gets fixed!

MaartenGr commented 1 year ago

I just pushed a fix to the main branch which should hopefully solve your issue!

bakachan19 commented 1 year ago

Hi @MaartenGr. I tried to use concept with Google colab. I did pip install and the concept version is 0.2.1. I still get the KeyError: '[-1] not found in axis' error when I use a particular dataset. Any ideas on what might be the issue?

Thank you for your time and help.

MaartenGr commented 1 year ago

@bakachan19 If you install it through the main branch, it should have the fix for the error you are getting.

bakachan19 commented 1 year ago

Oh, I see. Thanks a lot @MaartenGr.

bakachan19 commented 1 year ago

Hi @MaartenGr. I apologize for bothering you again. I did install the concept package through the main branch and making sure the the scikit-learn version is compatible. I do not get the previous error anymore, but I do get several different ones depending of the size of the concept. I am using the default concept model configuration, I only change the min_concept_size.

ValueError: attempt to get argmax of an empty sequence

- with min_concept_size = 10, I get this one:

/usr/local/lib/python3.9/dist-packages/concept/_model.py in (.0) 353
354 --> 355 selected_exemplars = {cluster: mmr(self.cluster_embeddings[cluster], 356 exemplar_embeddings[cluster], 357 representative_images[cluster]["Indices"],

IndexError: list index out of range



Thank you for your time and help.
MaartenGr commented 1 year ago

@bakachan19 Strange, I am not entirely sure what is happening. Could you share your full code and the versions of packages in your environment? I will look into this but just in the meantime, there is an option to use images with BERTopic that should provide similar, albeit not the same, functionality.

bakachan19 commented 1 year ago

@MaartenGr I did managed to make it work with different configuration of UMAP: by changing the nr_neighbors from 15 to a smaller number like 5 I was able to run the code with min_concept_size = 10. I think because my data is particular and with some configurations it does not found any clusters or maybe it clusters everything together... For the environment setup I use google colab with the following installation steps:

pip install scikit-learn==0.24.2
pip install git+https://github.com/MaartenGr/Concept.git

and then I just used the code provided in the tutorial:

from concept import ConceptModel
from umap import UMAP

concept_model = ConceptModel(min_concept_size = 10, umap_model = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state = 5, low_memory = False))

concepts = concept_model.fit_transform(images_name, docs=all_nouns)

Thank you for your time! Have a great day.

MaartenGr commented 1 year ago

Glad to hear that you solved the issue and thanks for sharing your solution. This will definitely help others having the same issue.