cemoody / lda2vec

MIT License
3.15k stars 628 forks source link

LDA2Vec doesn't work at all; does anyone have the correct code for python 3? #84

Open haebichan opened 6 years ago

haebichan commented 6 years ago

LDA2Vec doesn't seem to work at all at this current stage. Gensim code is outdated, the general code runs on Python 2.7, and people seem to be having problems with Chainer and other stuff.

I tried to revise the code to Python 3, but I'm hitting walls here and there, especially since I don't know how exactly every function is working. Did anyone solve these general issues? Did it actually work for anyone recently?

bosulliv commented 6 years ago

It is quite broken, even on python 2. I spun up a virtualenv, and spent an hour trying to wrestle the latest spacy API into the code. The problems for me are in preprocess.py: I've updated spacy to nlp = spacy.load('en') and also converted the document attribute arrays to 64 bit integers instead of 32 bit which were overflowing: But it is still producing negative values in the matrix which fail the assertion. I can't tell if another hour will solve it, so I'm going to carry on improving my LDA, NMF and LSA topic models instead.

haebichan commented 5 years ago

Hey thanks for responding and confirming it. The nlp = spacy.load('en') shouldn't work since that's deprecated and changed to nlp = spacy.load('en_core_web_sm'). But there's so many other problems, I'm not sure if it's worth trying to fix everything.

aleksandra-datacamp commented 5 years ago

If you use np.uint64 as dtype, it works. Preprocess becomes:

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
             **kwargs):
    if nlp is None:
        nlp = spacy.load('en_core_web_md')
    data = np.zeros((len(texts), max_length), dtype='uint64')
    skip = np.uint64(skip)
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
        if len(dat) > 0:
            msg = "Negative indices reserved for special tokens" 
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            # select the indices of tokens that are URLs or Emails
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat = dat.astype('int64')
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab
ghost commented 5 years ago

I can't even successfully execute "python setup.py install". A lot of errors occur in C++ code: https://github.com/cemoody/lda2vec/issues/86

GregSilverman commented 5 years ago

Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says

"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work. Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess. The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."

Not very encouraging, which is kind of disappointing.

MChrys commented 5 years ago

Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says

"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work. Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess. The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."

Not very encouraging, which is kind of disappointing.

Hello Greg, It's my first impot on github, and i was unable to import the original repository, (no module Lda2vec) I would like do it with the tensorflow repo but there no documentation or example. Could you repo the code you used with your own test ? it would be awesome !

GregSilverman commented 5 years ago

I haven't actually done anything with it! I was hoping someone else had. ^_^

MChrys commented 5 years ago

ok :) thank you for your answer

nateraw commented 5 years ago

I also have my own tensorflow implementation up, adapted from the one @MChrys linked to. Again, it works, but it is very finicky

khan04 commented 5 years ago

Hello all,

I was struggling to setup, and also run some of the functions with Python 3.7. I got installed, but facing lot of issues. I could visualize 20newsgroup data as I have the the generated file available. Trying to create the file in .npz format, no luck yet.

Question to Chris: Just wondering if you have a working version (most latest) that we can try out? Also facing lot of issues with Cupy install. Can we run without GPU functionality?

Thank you!

Ahmed Khan

whcjimmy commented 5 years ago

try my fork: https://github.com/whcjimmy/lda2vec.

I've tested the twenty_newsgroups example.

khan04 commented 5 years ago

I will try yous, thank you so much Jimmy.

Just two questions:

1) In the doc in your Github, it says word vector for 'German' is -0.6, any idea how to get that number. Also on the RHS, the document vector is -0.7 - how to get that one as well?

2) I was getting these errors in compiling preprocess file under example dir: File "C:\Users\Administrator\Anaconda3\lib\site-packages\lda2vec\corpus.py", line 159, in finalize self.specials_to_compact = {s: self.loose_to_compact[i] for s, i in self.specials.items()}

File "C:\Users\Administrator\Anaconda3\lib\site-packages\lda2vec\corpus.py", line 159, in self.specials_to_compact = {s: self.loose_to_compact[i] for s, i in self.specials.items()}

KeyError: -1

Did you get similar errors as well?

Thanks, AK

On Thu, Feb 14, 2019 at 1:13 AM Jimmy Wang notifications@github.com wrote:

try my fork: https://github.com/whcjimmy/lda2vec.

I've tested the twenty_newsgroups example.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cemoody/lda2vec/issues/84#issuecomment-463551054, or mute the thread https://github.com/notifications/unsubscribe-auth/AlkRX2esMIt4-CnEdG2mbLcbu_A7Pn2Jks5vNSi-gaJpZM4X2qM1 .

whcjimmy commented 5 years ago

My doc follows this repo and i didn't change any details, but I can try to answer your questions.

1) It doesn't mean "German" is -0.6. The whole 1 5 word vector is used to represent a word "German". Maybe the word vector comes from a pre-trained word2vec model which is GoogleNews-vectors-negative300.bin, I am not that sure. Document vector comes from Document proportion multiplies with topic matrix. So, 0.41(-1.9)+0.26 0.96+0.34(-0.7) is -0.7674 which nears -0.7.

2) I didn't get this error. However, in corpus.py, you can find out that the only key number less than 0 is "-2" whcih means special tokens (in line 140). Maybe you can check why key number "-1" is generated.

Hope these answers help you!

JennieGerhardt commented 4 years ago

when using the file 'preprogress.py',the outcome of vocab is bad? 12521213015474045184: u"max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'avpvt_%n2ijl8ymd9#oq", 6474950898978915842: u'160k', 13196128760786322950: u'liberty',

lordtt13 commented 4 years ago

If you use np.uint64 as dtype, it works. Preprocess becomes:

def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
             **kwargs):
    if nlp is None:
        nlp = spacy.load('en_core_web_md')
    data = np.zeros((len(texts), max_length), dtype='uint64')
    skip = np.uint64(skip)
    data[:] = skip
    bad_deps = ('amod', 'compound')
    for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
        if merge:
            # from the spaCy blog, an example on how to merge
            # noun phrases into single tokens
            for phrase in doc.noun_chunks:
                # Only keep adjectives and nouns, e.g. "good ideas"
                while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
                    phrase = phrase[1:]
                if len(phrase) > 1:
                    # Merge the tokens, e.g. good_ideas
                    phrase.merge(phrase.root.tag_, phrase.text,
                                 phrase.root.ent_type_)
                # Iterate over named entities
                for ent in doc.ents:
                    if len(ent) > 1:
                        # Merge them into single tokens
                        ent.merge(ent.root.tag_, ent.text, ent.label_)
        dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
        if len(dat) > 0:
            msg = "Negative indices reserved for special tokens" 
            assert dat.min() >= 0, msg
            # Replace email and URL tokens
            # select the indices of tokens that are URLs or Emails
            idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
            dat = dat.astype('int64')
            dat[idx] = skip
            length = min(len(dat), max_length)
            data[row, :length] = dat[:length, 0].ravel()
    uniques = np.unique(data)
    vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
    vocab[skip] = '<SKIP>'
    return data, vocab

Tried this out, doesn't work

lordtt13 commented 4 years ago

Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.

duaaalkhafaje commented 3 years ago

Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.

Hello from 2021 I wonder if you have completed the work on (LDA2Vec) or not?, because frankly, I have worked on it a lot, but I still face many problems