Closed shimonShouei closed 1 year ago
Thanks for giving TMT a try. I think the answer is "not easily". However, I wonder if I understand your question. Could you post the code you are referring to above?
given a fitted tmt object 'tmt2' and bertTopicModel from the tmt2 'bt1' I want to make bt1.transform([new_doc]). I saw it required the embeddings, so I did 'tmt2.embeddings = None' and 'tmt2.createEmbeddings([new_doc])' and then 'bt1.transform(tmt2.docs, tmt2.embeding)' but it returns predictions for all the docs of the training...
So the BERTopic model that is returned should be a fully functioning 100% BERTopic model and the issue you are asking about has to do with BERTopic, not TMT. [It would be helpful if you can provide as close to actual code or edited psuedocode, not a summary, it makes it easier for me to read and also catch logical errors - which I'm wondering about in this case.]
You are using TMT to create embeddings for the new docs and then passing them in. I don't really see why that would be a problem but to make the issue simpler I suggest you simply use TMT to get the params you want, then switch entirely over to BERTopic. What I suggest is that you do something like this:
tmtModel = TMT()
tmtModel.createEmbeddings(OriginalDocs)
tmtModel.reduce()
... Figure out the right parameters for your model using searches ...
btModel = tmtModel.getBERTopicModel()
btModel.fit_transform(OriginalDocs)
btModel.transform(NewDocs)
In the above, the btModel.transform(NewDocs) will generate the embeddings, run UMAP and then call HDBSCAN's approximate_predict() which will give you an approximation of where the NewDocs would have been clustered in the original clustering of OriginalDocs. I don't think you are gaining anything by using TMT to create the embeddings and then pass them in to the BERTopic model. BERTopic will run the embedding on the NewDocs if you omit the embeddings in the call to transform
and there is no particular advantage to doing this within TMT and then passing the embeddings (which should work - but I'm not seeing where the error is yet).
Let me know if this works - if you have additional questions then providing as much of the code or psuedo code as possible will help.
thanks, I tried your solution, and still the same. this is the code:
btModel = tmt.getBERTopicModel(165,16) btModel.fit_transform(df.text.values) btModel.transform([new doc])
It returns the predictions for the original docs for some reason
Hmmm... It could be an issue with BERTopic. Why don't you run with just BERTopic (don't worry about the tuning) and see what happens. I can't think of anything that would cause this in TMT off the top of my head. But if it works in BERTopic without TMT then I can take a look.
I did it. It works
Can I see your code and data?
this is the colab https://colab.research.google.com/drive/1vJMC_KCjL8Sv_t8s3X1iC6xt0-1mavrp?usp=sharing this is the data annotated_target_topic_data.csv by the way, TopicTuner doesn't check the coherence of the model right?
thank you very much for the responsiveness!!
So I couldn't access your colab but I was able to reproduce the problem. It is because the umap facade class that gets created in TMT fixes its embedding output - so BERTopic never sees the new reduced embeddings - it just gets the old reduced embeddings. I'm finishing up a new version of TMT which does away with this (flawed) method.
There is a workaround -
1) Tune your tmt model
2) Get a BERTopic model
newBTModel = tmt.getBERTopicModel(<YOUR MIN_CLUSTER_SIZE>, <YOUR MIN_SAMPLES>)
3) Replace the umap facade with the tmt reducer_model
newBTModel.umap_model = tmt.reducer_model
4) run transform
newpreds = newBTModel.transform(newtext)
I tested this and newpreds[1] should have the right number of predictions.
Sorry for the trouble. If it sounds ok to you I'm gonna leave this alone in this version b/c there has been a re-factoring and a substantial change in the way that the BERTopic model is being created which eliminates this problem. So rather than fix this in this version I prefer to just focus on the new release where this shouldn't be a problem (I'll add it to the new test cases :) ).
Thanks, it works!! I am so sorry that the collab wasn't authorized! By the way, it needed to fit the newBTModel before transforming the new doc.
Great. Glad that worked.
I'm closing this now as it is fixed in the new release.
Hi, Is it possible to transform new docs in the bert topic model in the tmt object? I didn't succeed, when I run the transform line it returns the topics list of all the docs of the fit...