biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
128 stars 84 forks source link

Feature request: way to use term-document matrix as input. #512

Closed jkaupanger closed 4 years ago

jkaupanger commented 4 years ago

I have a csv with tokens as columns and documents as rows. Because of the nature of the data, that's how it is starting; in other words, I don't really have access to the "untokenized" data because it doesn't exist.

As far as I can tell, there isn't a way to input a term-document matrix into Orange. Am I wrong?

Specifically, what I'm trying to do is conduct some topic modelling (HDP and LDA).

ajdapretnar commented 4 years ago

So if I understand correctly, you already have a bag of words representation of your corpus with tokens are columns?

First, no, there's no term-document matrix in Orange. Orange wouldn't know how to work with it anyway (in the current version).

Second, topic modelling works with document-term matrix or, if none exists, it builds its own. The bag of words matrix is based on tokens.

Your workaround could be using the following Python Script directly after Corpus. It takes existing columns and maps them into tokens and sets the column attribute to bow-feature.

import numpy as np

# build tokens
tokens = []
for doc in in_data.X:
    temp = []
    for i, token in enumerate(doc):
        if not np.isnan(token):
            temp.append(in_data.domain.attributes[i].name)
    tokens.append(temp)

out = in_data.copy()

# tag variables as bow features
for var in out.domain.attributes:
    var.attributes.update({'bow-feature': True})

out.store_tokens(tokens)
out_data = out

I am sure the script could be much nicer, but I was in a rush and haven't had my coffee yet. :)

jkaupanger commented 4 years ago

I can never remember which is supposed to come first in the title (document or term), but either way it's all just in a csv file, so it's easy enough for me to just copy + paste to make rows columns and vice versa.

Would doing so make the process easier, or is it roughly the same either way?

ajdapretnar commented 4 years ago

Orange needs the token information. That's how it's built. The above script creates tokens from your terms, so you need it either way.

We could make a widget that would accept already constructed term matrix, but it would likely be in the Prototypes. If Python Script works for you atm, please go with it. We will discuss if we intend to support this or not. Thanks!

jkaupanger commented 4 years ago

Sorry, I wasn't trying to get around using the Python script; I was just wondering if it would make life easier if the tokens were rows or columns.

Two questions: do the blank spots need to have zeroes, or can they be empty? Also, assumedly the Bag of Words widget would be the one that would follow the Python script widget, right?

ajdapretnar commented 4 years ago

Got it. So Orange works with tokens in columns, but not even that. Most widgets need some hidden information, a dictionary of tokens. This is not visible from the structure of the spreadsheet, it is created behind the scenes. That is why Orange will not work if you just put tokens in columns and feed it to Topic Modeling. It needs that extra information and that script provides it.

Blank spots can be empty. You don't need bag of words, you already have the bag of word structure originally.

jkaupanger commented 4 years ago

Okay; thanks for the note about skipping the BoW widget. When I was using it, the Topic Modelling widget would sit there and do nothing.

Now that I've left it out, however, it's...not doing much more: image The little status...circle around the topic modelling widget looks like it's started (which is more than it did before), but it's just stuck at 0% with no indication of how long it's going to take or even if it's working at all.

You probably didn't respond to this post to help some rando troubleshoot their Orange 3 workflow, lol, but any insight that you could provide would be GREATLY appreciated.

ajdapretnar commented 4 years ago

Are you using HDP? Because this method has some problems, it was fixed in #511. This new version of Text should definitely work.

jkaupanger commented 4 years ago

Okay, I updated to 3.25.0, I'll try again.

In the meantime, I have all of my columns (tokens) identified as Text variables and my row labels (document...labels, I guess) are ignored. Does that make sense, or should I identify the row labels as meta variables or something?

jkaupanger commented 4 years ago

So, I tried it again with it updated: now the progress bar spins around the widget but still doesn't progress past 0%.

ajdapretnar commented 4 years ago

Is there a chance for you to post a small sample data set which fails to work for your case?

jkaupanger commented 4 years ago

sample_data.xlsx The data is somewhat proprietary; however, this spreadsheet has had the document and token names IDed uniquely, so it behaves exactly like the data I'm working with.

ajdapretnar commented 4 years ago

Yes, that'll do. I'll have a look tomorrow. Thanks!

ajdapretnar commented 4 years ago

I tried with your example and it worked for me. Yes, it did take a long time - your data is big.

You can try shaving off a fraction of a time by casting the matrix to sparse:

import numpy as np
from scipy.sparse import csr_matrix

# build tokens
tokens = []
for doc in in_data.X:
    temp = []
    for i, token in enumerate(doc):
        if not np.isnan(token):
            temp.append(in_data.domain.attributes[i].name)
    tokens.append(temp)

out = in_data.copy()

# tag variables as bow features
for var in out.domain.attributes:
    var.attributes.update({'bow-feature': True})

out.store_tokens(tokens)
out.X = csr_matrix(out.X)
out_data = out

But it is, what it is.

ajdapretnar commented 4 years ago

This is a wontfix. We discussed it and it is simply not the right fit for Orange. A script is available in https://github.com/biolab/orange-scripts/pull/3 to bypass the lack of functionality.

jkaupanger commented 4 years ago

Sounds good. Thanks for taking a look!

I don't know if you want me to open another issue for these, but I did have two more questions: