Closed SaraAmd closed 1 year ago
Hi, you can load the tsv file and then split the words using the spaces and save only the unique words. Like this:
import pandas as pd
df = pd.read_csv(dataset_path + "/corpus.tsv", sep='\t', header=None)
vocabulary = set()
for document in df[0].tolist():
for word in document.split():
vocabulary.add(word)
with open(dataset_path + "/vocabulary.txt", 'w') as fw:
for word in vocabulary:
fw.write(word)
Best,
Silvia
how to generate vocabulary file from our csv / tsv dataset?