ariddell / tatom

Quantitative Text Analysis for the digitale Geisteswissenschaften
https://de.dariah.eu/tatom/
47 stars 17 forks source link

IndexError - Topic Modeling with Mallet Tutorial #16

Closed malletnewbie closed 7 years ago

malletnewbie commented 7 years ago

Hello there,

I am trying to use the code from the tutorial on topic modeling and I am facing a problem I cannot solve on my own:

Traceback (most recent call last): File "/.../mallet_python.py", line 47, in doctopic[row_num, topic] = share IndexError: index 14 is out of bounds for axis 1 with size 6

I have copied the code, stored it in a .py file and adjusted the path to the doc-topic-file:

import os
import numpy as np
import itertools
import operator

def grouper(n, iterable, fillvalue=None):
    #Collect data into fixed-length chunks or blocks
    args = [iter(iterable)] * n
    return itertools.zip_longest(*args, fillvalue=fillvalue)

doctopic_triples = []

mallet_docnames = []

with open("doc-topics.txt") as f:
    f.readline()            # read one line in order to skip the header
    for line in f:
        docnum, docname, *values = line.rstrip().split('\t')
        mallet_docnames.append(docname)
        for topic, share in grouper(2, values):
            triple = (docname, int(topic), float(share))
            doctopic_triples.append(triple)

#sort the triples
#triple is (docname, topicnum, share) so sort(key=operator.itemgetter(0,1))
#sorts on (docname, topicnum) which is what we want
doctopic_triples = sorted(doctopic_triples, key=operator.itemgetter(0,1))

#sort the document names rather than relying on MALLET's ordering
mallet_docnames = sorted(mallet_docnames)

#collect into a document-term matrix
num_docs = len(mallet_docnames)

num_topics = len(doctopic_triples) // len(mallet_docnames)

#the following works because we know that the triples are in sequential order
doctopic = np.zeros((num_docs, num_topics))

for triple in doctopic_triples:
    docname, topic, share = triple
    row_num = mallet_docnames.index(docname)
    doctopic[row_num, topic] = share            # error

My doc-topic file has the following structure:

doc_number filename topic share ....

There are all together 6 columns with topics and 6 with shares, thus, this makes 12 in total ... + doc_number and filename makes 14... I guess that is what the error is about. But I don't know what I am doing wrong.

Thanks in advance!

ariddell commented 7 years ago

Python uses zero-based numbering so while you have 14 columns the last one is number "13".