csv
file contains something likes ID, text, labels
. As a result, this script serves that intent.Basically, just run:
python main.py path_to_dir
where path_to_dir
is the absolute path to the directory containing file rcv1.tar.xz
. It would output 2 csv
files at path_to_dir
:
rcv1_v2.csv
: your main interested datarcv1_v2_topics_desc.csv
: description about topicsThe content in column text
are raw text in xml format. It can be parsed easily with xml.etree.ElementTree.XML(text)