Open haebichan opened 6 years ago
It is quite broken, even on python 2. I spun up a virtualenv, and spent an hour trying to wrestle the latest spacy API into the code. The problems for me are in preprocess.py: I've updated spacy to nlp = spacy.load('en') and also converted the document attribute arrays to 64 bit integers instead of 32 bit which were overflowing: But it is still producing negative values in the matrix which fail the assertion. I can't tell if another hour will solve it, so I'm going to carry on improving my LDA, NMF and LSA topic models instead.
Hey thanks for responding and confirming it. The nlp = spacy.load('en') shouldn't work since that's deprecated and changed to nlp = spacy.load('en_core_web_sm'). But there's so many other problems, I'm not sure if it's worth trying to fix everything.
If you use np.uint64 as dtype, it works. Preprocess becomes:
def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None,
**kwargs):
if nlp is None:
nlp = spacy.load('en_core_web_md')
data = np.zeros((len(texts), max_length), dtype='uint64')
skip = np.uint64(skip)
data[:] = skip
bad_deps = ('amod', 'compound')
for row, doc in enumerate(nlp.pipe(texts, **kwargs)):
if merge:
# from the spaCy blog, an example on how to merge
# noun phrases into single tokens
for phrase in doc.noun_chunks:
# Only keep adjectives and nouns, e.g. "good ideas"
while len(phrase) > 1 and phrase[0].dep_ not in bad_deps:
phrase = phrase[1:]
if len(phrase) > 1:
# Merge the tokens, e.g. good_ideas
phrase.merge(phrase.root.tag_, phrase.text,
phrase.root.ent_type_)
# Iterate over named entities
for ent in doc.ents:
if len(ent) > 1:
# Merge them into single tokens
ent.merge(ent.root.tag_, ent.text, ent.label_)
dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL])
if len(dat) > 0:
msg = "Negative indices reserved for special tokens"
assert dat.min() >= 0, msg
# Replace email and URL tokens
# select the indices of tokens that are URLs or Emails
idx = (dat[:, 1] > 0) | (dat[:, 2] > 0)
dat = dat.astype('int64')
dat[idx] = skip
length = min(len(dat), max_length)
data[row, :length] = dat[:length, 0].ravel()
uniques = np.unique(data)
vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip}
vocab[skip] = '<SKIP>'
return data, vocab
I can't even successfully execute "python setup.py install". A lot of errors occur in C++ code: https://github.com/cemoody/lda2vec/issues/86
Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says
"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work. Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess. The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."
Not very encouraging, which is kind of disappointing.
Here's a port to tensorflow that allegedly works with python 3 lda2vec-tf. Here's also a port to pytorch lda2vec-pytorch (NB: in the pytorch readme, it says
"Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work. Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess. The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments."
Not very encouraging, which is kind of disappointing.
Hello Greg, It's my first impot on github, and i was unable to import the original repository, (no module Lda2vec) I would like do it with the tensorflow repo but there no documentation or example. Could you repo the code you used with your own test ? it would be awesome !
I haven't actually done anything with it! I was hoping someone else had. ^_^
ok :) thank you for your answer
I also have my own tensorflow implementation up, adapted from the one @MChrys linked to. Again, it works, but it is very finicky
Hello all,
I was struggling to setup, and also run some of the functions with Python 3.7. I got installed, but facing lot of issues. I could visualize 20newsgroup data as I have the the generated file available. Trying to create the file in .npz format, no luck yet.
Question to Chris: Just wondering if you have a working version (most latest) that we can try out? Also facing lot of issues with Cupy install. Can we run without GPU functionality?
Thank you!
Ahmed Khan
try my fork: https://github.com/whcjimmy/lda2vec.
I've tested the twenty_newsgroups example.
I will try yous, thank you so much Jimmy.
Just two questions:
1) In the doc in your Github, it says word vector for 'German' is -0.6, any idea how to get that number. Also on the RHS, the document vector is -0.7 - how to get that one as well?
2) I was getting these errors in compiling preprocess file under example dir: File "C:\Users\Administrator\Anaconda3\lib\site-packages\lda2vec\corpus.py", line 159, in finalize self.specials_to_compact = {s: self.loose_to_compact[i] for s, i in self.specials.items()}
File
"C:\Users\Administrator\Anaconda3\lib\site-packages\lda2vec\corpus.py",
line 159, in
KeyError: -1
Did you get similar errors as well?
Thanks, AK
On Thu, Feb 14, 2019 at 1:13 AM Jimmy Wang notifications@github.com wrote:
try my fork: https://github.com/whcjimmy/lda2vec.
I've tested the twenty_newsgroups example.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/cemoody/lda2vec/issues/84#issuecomment-463551054, or mute the thread https://github.com/notifications/unsubscribe-auth/AlkRX2esMIt4-CnEdG2mbLcbu_A7Pn2Jks5vNSi-gaJpZM4X2qM1 .
My doc follows this repo and i didn't change any details, but I can try to answer your questions.
1) It doesn't mean "German" is -0.6. The whole 1 5 word vector is used to represent a word "German". Maybe the word vector comes from a pre-trained word2vec model which is GoogleNews-vectors-negative300.bin, I am not that sure. Document vector comes from Document proportion multiplies with topic matrix. So, 0.41(-1.9)+0.26 0.96+0.34(-0.7) is -0.7674 which nears -0.7.
2) I didn't get this error. However, in corpus.py, you can find out that the only key number less than 0 is "-2" whcih means special tokens (in line 140). Maybe you can check why key number "-1" is generated.
Hope these answers help you!
when using the file 'preprogress.py',the outcome of vocab is bad? 12521213015474045184: u"max>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'ax>'avpvt_%n2ijl8ymd9#oq", 6474950898978915842: u'160k', 13196128760786322950: u'liberty',
If you use np.uint64 as dtype, it works. Preprocess becomes:
def tokenize(texts, max_length, skip=-2, attr=LOWER, merge=False, nlp=None, **kwargs): if nlp is None: nlp = spacy.load('en_core_web_md') data = np.zeros((len(texts), max_length), dtype='uint64') skip = np.uint64(skip) data[:] = skip bad_deps = ('amod', 'compound') for row, doc in enumerate(nlp.pipe(texts, **kwargs)): if merge: # from the spaCy blog, an example on how to merge # noun phrases into single tokens for phrase in doc.noun_chunks: # Only keep adjectives and nouns, e.g. "good ideas" while len(phrase) > 1 and phrase[0].dep_ not in bad_deps: phrase = phrase[1:] if len(phrase) > 1: # Merge the tokens, e.g. good_ideas phrase.merge(phrase.root.tag_, phrase.text, phrase.root.ent_type_) # Iterate over named entities for ent in doc.ents: if len(ent) > 1: # Merge them into single tokens ent.merge(ent.root.tag_, ent.text, ent.label_) dat = doc.to_array([attr, LIKE_URL, LIKE_EMAIL]) if len(dat) > 0: msg = "Negative indices reserved for special tokens" assert dat.min() >= 0, msg # Replace email and URL tokens # select the indices of tokens that are URLs or Emails idx = (dat[:, 1] > 0) | (dat[:, 2] > 0) dat = dat.astype('int64') dat[idx] = skip length = min(len(dat), max_length) data[row, :length] = dat[:length, 0].ravel() uniques = np.unique(data) vocab = {v: nlp.vocab[v].lower_ for v in uniques if v != skip} vocab[skip] = '<SKIP>' return data, vocab
Tried this out, doesn't work
Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.
Basically I have tried everything out in porting it to python 3, and I'm not even able to get the preprocess functions working. Saw this issue and tried out everything here too. Going to use gensim LDA.
Hello from 2021 I wonder if you have completed the work on (LDA2Vec) or not?, because frankly, I have worked on it a lot, but I still face many problems
LDA2Vec doesn't seem to work at all at this current stage. Gensim code is outdated, the general code runs on Python 2.7, and people seem to be having problems with Chainer and other stuff.
I tried to revise the code to Python 3, but I'm hitting walls here and there, especially since I don't know how exactly every function is working. Did anyone solve these general issues? Did it actually work for anyone recently?