Can you tell how the datasets are preprocessed?

MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

MIT License

705 stars 98 forks source link

Can you tell how the datasets are preprocessed? #93

Closed ERijck closed 1 year ago

ERijck commented 1 year ago

Hi, is it possible to share the preprocessing settings steps for each dataset? E.g. What was the threshold for removing frequent/infrequent words?

silviatti commented 1 year ago

Hi, each dataset folder contains a "metadata.json" file, with the preprocessing details (see the field preprocessing-info). For example BBC news dataset. As far as I remember, we selected those values by iteratively trying different values and inspecting the resulting topics.

ERijck commented 1 year ago

Thanks, Silviatti!