MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
705 stars 98 forks source link

Error loading custom dataset #90

Open tkap243 opened 1 year ago

tkap243 commented 1 year ago

Description

Hello,

I am having trouble loading my custom dataset. I followed the guide in the main README and am getting the below errors.

What I Did

from octis.dataset.dataset import Dataset import pandas as pd

df = pd.read_csv("/mnt/mydata/notebooks/data.csv")

df.to_csv('corpus.tsv', sep="\t", header= False, columns=['documents']) dataset.load_custom_dataset_from_folder("/mnt/mydata/notebooks")

/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py:330: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  final_df = df[df[1] == 'train'].append(df[df[1] == 'val'])
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py:331: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  final_df = final_df.append(df[df[1] == 'test'])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in load_custom_dataset_from_folder(self, path, multilabel)
    335 
--> 336                 self.__corpus = [d.split() for d in final_df[0].tolist()]
    337                 if len(final_df.keys()) > 2:

/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in <listcomp>(.0)
    335 
--> 336                 self.__corpus = [d.split() for d in final_df[0].tolist()]
    337                 if len(final_df.keys()) > 2:

AttributeError: 'int' object has no attribute 'split'

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
<ipython-input-16-28e6bd2fc3cd> in <module>
      1 dataset = Dataset()
----> 2 dataset.load_custom_dataset_from_folder("/mnt/mydata/notebooks")

/opt/conda/lib/python3.8/site-packages/octis/dataset/dataset.py in load_custom_dataset_from_folder(self, path, multilabel)
    356                 self._load_document_indexes(self.dataset_path + "/indexes.txt")
    357         except:
--> 358             raise Exception("error in loading the dataset:" + self.dataset_path)
    359 
    360     def fetch_dataset(self, dataset_name, data_home=None, download_if_missing=True):

Exception: error in loading the dataset:/mnt/mydata/notebooks
SaraAmd commented 1 year ago

in [Load a Custom Dataset] section, it is mentioned that our data set should have a vocabulary file while my dataset is just a csv file I am wondering how can we generate this vocab file. does this pipeline generate it automatically?

tkap243 commented 1 year ago

Per the readme, the custom dataset is a tsv file, which is what our csv is. I'm uncertain what the vocab file should be.

silviatti commented 1 year ago

Hi, the vocabulary file is just the list of words contained in the documents. You can see #92 on how to generate it from the tsv file.