MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
734 stars 106 forks source link

num_samples should be a positive integer value, but got num_samples=0 #30

Open marianafdz465 opened 3 years ago

marianafdz465 commented 3 years ago

Description

I am not sure why when I try to run the optimize function I get this error "num_samples should be a positive integer value, but got num_samples=0"

What I Did

from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("mydata")

model = CTM(num_topics=10,
            num_epochs=30,
            inference_type='zeroshot', 
            bert_model="distiluse-base-multilingual-cased")

npmi = Coherence(texts=dataset.get_corpus())

search_space = {"num_layers": Categorical({1, 2, 3}), 
                "num_neurons": Categorical({100, 200, 300}),
                "activation": Categorical({'relu', 'softplus'}), 
                "dropout": Real(0.0, 0.95)
                }
optimization_runs=30
model_runs=1

optimizer=Optimizer()
optimization_result = optimizer.optimize(
    model, dataset, npmi, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    plot_best_seen=True, plot_model=True, plot_name="B0_plot", 
    save_path='results2/test_ctm//')

I can't find where to write this variable "num_samples"

silviatti commented 3 years ago

Hi Mariana! Thanks for reporting this issue. I tried to reproduce the error using your code and some other data, but the error doesn't occur. Can you please share your data (by email if you like)? Can you also tell me the version of the library, your python version and your operating system?

Thank you,

Silvia

A11en0 commented 3 years ago

Same problem. How did you solve it?

silviatti commented 3 years ago

Hi A11en0, can you please share your code, version of the library, your python version, and your operating system?

I'd be happy to help to solve the issue

alyrazik commented 2 years ago

Hello, I have the same problem. I am using colab. and received this error: "ValueError: num_samples should be a positive integer value, but got num_samples=0" OCTIS version: Version: 1.10.3

My code is as below: (data_sample is a pandas dataframe, with a text column that is a series of articles in Arabic not English)

data_sample['partition'] = 'train'
data_sample['partition'][0:100] = 'validation'
data_sample['partition'][100:200] = 'test'
columns_titles = ['text' ,'partition', 'targe']
data_sample=data_sample.reindex(columns=columns_titles)
data_sample.to_csv('/content/drive/MyDrive/Dataset/OCTIS/corpus.tsv', sep='\t', index=False, header=False)
doc = ['']
for text in data_sample['text']:
  doc = doc + [text]

doc = ' '.join(doc)
doc = list(set(doc.split()))
with open('/content/drive/MyDrive/Dataset/OCTIS/vocabulary.txt', 'w') as output_file:
    for token in doc:
        output_file.write(token + '\n')
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("/content/drive/MyDrive/Dataset/OCTIS/")
from octis.models.CTM import CTM
model = CTM(num_topics=10)
model_output = model.train_model(dataset) # Train the model

Thanks for your help.

silviatti commented 2 years ago

Hello @alyrazik, could you send me the dataset (if possible) by email? I would really like to replicate this error but it has never happened with my data. So I wonder if it's something related to the data. Can you check if some documents are empty? Can you also share the full error stack?

Thanks a lot,

Silvia

alyrazik commented 2 years ago

Hello @silviatti ,

Thank you. The full error is below. I sent you the dataset and link to my Colab code via email. Thanks.

ValueError                                Traceback (most recent call last)
[<ipython-input-37-f0307d819d49>](https://localhost:8080/#) in <module>()
      5 #             bert_model="distiluse-base-multilingual-cased")
      6 model = CTM(num_topics=10)
----> 7 model_output = model.train_model(dataset) # Train the model
      8 cv = Coherence(texts=dataset.get_corpus(),topk=10, measure='c_npmi')
      9 topic_diversity = TopicDiversity(topk=10)

3 frames
[/usr/local/lib/python3.7/dist-packages/octis/models/CTM.py](https://localhost:8080/#) in train_model(self, dataset, hyperparameters, top_words)
    113                                  reduce_on_plateau=self.hyperparameters['reduce_on_plateau'],
    114                                  topic_prior_variance=self.hyperparameters["prior_variance"])
--> 115             self.model.fit(x_train, x_valid, verbose=False)
    116             result = self.inference(x_test)
    117             return result

[/usr/local/lib/python3.7/dist-packages/octis/models/contextualized_topic_models/models/ctm.py](https://localhost:8080/#) in fit(self, train_dataset, validation_dataset, save_dir, verbose)
    277                 validation_loader = DataLoader(
    278                     self.validation_data, batch_size=self.batch_size, shuffle=True,
--> 279                     num_workers=self.num_data_loader_workers)
    280                 # train epoch
    281                 s = datetime.datetime.now()

[/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py](https://localhost:8080/#) in __init__(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context, generator, prefetch_factor, persistent_workers)
    266             else:  # map-style
    267                 if shuffle:
--> 268                     sampler = RandomSampler(dataset, generator=generator)
    269                 else:
    270                     sampler = SequentialSampler(dataset)

[/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py](https://localhost:8080/#) in __init__(self, data_source, replacement, num_samples, generator)
    101         if not isinstance(self.num_samples, int) or self.num_samples <= 0:
    102             raise ValueError("num_samples should be a positive integer "
--> 103                              "value, but got num_samples={}".format(self.num_samples))
    104 
    105     @property

ValueError: num_samples should be a positive integer value, but got num_samples=0
alyrazik commented 2 years ago

Hello @silviatti Some findings:

  1. The name of the validation partition in the dataset has to be 'val' . I was using 'validation' instead which made the partitioning code excluding all rows with this partition value. (hence, 0 was seen as the number of samples). Also, after renaming the column to 'val, I had to go to the project folder and manually remove the _val.pkl file (which would be invalid).
  2. The code to read the .tsv and .txt files decodes the files as windows encoding CP-1252 (not sure why) which is okay for English but not for Arabic. For arabic, the data is saved as utf-16 and the reading code, should include the optional argument for encoding='utf-16' as well.
DaryaZareM commented 1 year ago

Hi I faced the same problem. How can i solve it?

silviatti commented 1 year ago

@DaryaZareM could you provide more information? Thanks,

Silvia