MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
705 stars 98 forks source link

problems partitioning custom dataset #101

Open afriedman412 opened 1 year ago

afriedman412 commented 1 year ago

Description

Trying to run Optimization, following this tutorial on custom dataset raises:

File /usr/local/lib/python3.9/dist-packages/octis/models/pytorchavitm/AVITM.py:77, in AVITM.train_model(self, dataset, hyperparameters, top_words)
     74 self.set_params(hyperparameters)
     76 if self.use_partitions:
---> 77     train, validation, test = dataset.get_partitioned_corpus(use_validation=True)
     79     data_corpus_train = [' '.join(i) for i in train]
     80     data_corpus_test = [' '.join(i) for i in test]

TypeError: cannot unpack non-iterable NoneType object

What I Did

Here's the code for creating the custom dataset from a list of strings...

# docs is a list of strings

# collect tokens
tokens = []
for d in tqdm(docs):
    tokens += word_tokenize(d.lower())

# write vocab file
with open("octis_dataset/vocabulary.txt", "w+") as f:
    for s in tqdm(set(tokens)):
        f.write(s + "\n")

# create corpus tsv
df = pd.DataFrame(docs)

# partition
tr_data = df.sample(48500, random_state=420)
te_data = df.query("index not in @tr_data.index").sample(12900, random_state=420)
val_data = df.query("index not in @tr_data.index and index not in @te_data.index")

df = pd.concat([tr_data, te_data, val_data])

# write tsv
df.to_csv("octis_dataset/corpus.tsv", sep="\t", header=None)

And here is the code to optimize the model...

optimizer=Optimizer()

start = time.time()
optimization_result = optimizer.optimize(
    model, dataset, coherence, search_space, number_of_call=optimization_runs, 
    model_runs=model_runs, save_models=True, 
    extra_metrics=None, # to keep track of other metrics
    save_path='results/test_neuralLDA/'
)
end = time.time()
duration = end - start
optimization_result.save_to_csv("results_neuralLDA.csv")
print('Optimizing model took: ' + str(round(duration)) + ' seconds.')

And this raises the error.