MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)
MIT License
705 stars 98 forks source link

vocabulary: a .txt for custom dataset #92

Closed SaraAmd closed 1 year ago

SaraAmd commented 1 year ago

how to generate vocabulary file from our csv / tsv dataset?

silviatti commented 1 year ago

Hi, you can load the tsv file and then split the words using the spaces and save only the unique words. Like this:

import pandas as pd
df = pd.read_csv(dataset_path + "/corpus.tsv", sep='\t', header=None)
vocabulary = set()
for document in df[0].tolist():
    for word in document.split(): 
         vocabulary.add(word)
with open(dataset_path + "/vocabulary.txt", 'w') as fw:
    for word in vocabulary:
        fw.write(word)

Best,

Silvia