MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
6.16k stars 764 forks source link

Error while loading custom data for BERTopic_evaluation repo #1211

Closed Pratik--Patel closed 1 year ago

Pratik--Patel commented 1 year ago

I am trying to use https://github.com/MaartenGr/BERTopic_evaluation and run OCTIS Topic Evaluation on my custom data. I could use the datasets that could be fetched through OCTIS for example, NewsGroup20 fine. But when I try to load custom data by following the instructions, I am running into following error.

!pip install bertopic==0.14.1
!pip install octis

from evaluation import Trainer, DataLoader
from evaluation import DataLoader

my_docs = [ ["apple", "banana", "oragnge"], ["google", "facebook", "orkut"]  ]
DataLoader(dataset="my_docs").prepare_docs(save="my_docs.txt", docs=my_docs).preprocess_octis(output_folder="my_docs")

This returns following error

TypeError                                 Traceback (most recent call last)
<ipython-input-3-53a122409303> in <module>
      3 
      4 my_docs = [ ["apple", "banana", "oragnge"], ["google", "facebook", "orkut"]  ]
----> 5 DataLoader(dataset="my_docs").prepare_docs(save="my_docs.txt", docs=my_docs).preprocess_octis(output_folder="my_docs")

~\Anaconda3\lib\site-packages\evaluation\data.py in preprocess_octis(self, preprocessor, documents_path, output_folder)
    178         if not documents_path:
    179             documents_path = self.doc_path
--> 180         dataset = preprocessor.preprocess_dataset(documents_path=documents_path)
    181         dataset.save(output_folder)
    182 

~\Anaconda3\lib\site-packages\octis\preprocessing\preprocessing.py in preprocess_dataset(self, documents_path, labels_path, multilabel)
    135         :return octis.dataset.dataset.Dataset
    136         """
--> 137         docs = [line.strip() for line in open(documents_path, 'r').readlines()]
    138         if self.num_processes is not None:
    139             # with Pool(self.num_processes) as p:

TypeError: expected str, bytes or os.PathLike object, not NoneType

Any idea why this could be ? Does DataLoader expect the custom data to be in a different format ?

Pratik--Patel commented 1 year ago

One thing is that, the file my_docs.txt is never generated when I run following

DataLoader(dataset="my_docs").prepare_docs(save="my_docs.txt", docs=my_docs)

After this, we should expect my_docs.txt file to be created right ?

MaartenGr commented 1 year ago

That repository was not meant to be used with custom datasets but only for the evaluation that was done in the paper. You might have to adjust the underlying code to make it usable for your own datasets.

MaartenGr commented 1 year ago

Closing this due to inactivity. Let me know if I need to re-open the issue!