drob-xx / TopicTuner

HDBSCAN Tuning for BERTopic Models
GNU General Public License v3.0
42 stars 1 forks source link

transform new docs #5

Closed shimonShouei closed 1 year ago

shimonShouei commented 1 year ago

Hi, Is it possible to transform new docs in the bert topic model in the tmt object? I didn't succeed, when I run the transform line it returns the topics list of all the docs of the fit...

drob-xx commented 1 year ago

Thanks for giving TMT a try. I think the answer is "not easily". However, I wonder if I understand your question. Could you post the code you are referring to above?

shimonShouei commented 1 year ago

given a fitted tmt object 'tmt2' and bertTopicModel from the tmt2 'bt1' I want to make bt1.transform([new_doc]). I saw it required the embeddings, so I did 'tmt2.embeddings = None' and 'tmt2.createEmbeddings([new_doc])' and then 'bt1.transform(tmt2.docs, tmt2.embeding)' but it returns predictions for all the docs of the training...

drob-xx commented 1 year ago

So the BERTopic model that is returned should be a fully functioning 100% BERTopic model and the issue you are asking about has to do with BERTopic, not TMT. [It would be helpful if you can provide as close to actual code or edited psuedocode, not a summary, it makes it easier for me to read and also catch logical errors - which I'm wondering about in this case.]

You are using TMT to create embeddings for the new docs and then passing them in. I don't really see why that would be a problem but to make the issue simpler I suggest you simply use TMT to get the params you want, then switch entirely over to BERTopic. What I suggest is that you do something like this:

tmtModel = TMT()
tmtModel.createEmbeddings(OriginalDocs)
tmtModel.reduce()

... Figure out the right parameters for your model using searches ...

btModel = tmtModel.getBERTopicModel()
btModel.fit_transform(OriginalDocs)
btModel.transform(NewDocs)

In the above, the btModel.transform(NewDocs) will generate the embeddings, run UMAP and then call HDBSCAN's approximate_predict() which will give you an approximation of where the NewDocs would have been clustered in the original clustering of OriginalDocs. I don't think you are gaining anything by using TMT to create the embeddings and then pass them in to the BERTopic model. BERTopic will run the embedding on the NewDocs if you omit the embeddings in the call to transform and there is no particular advantage to doing this within TMT and then passing the embeddings (which should work - but I'm not seeing where the error is yet).

Let me know if this works - if you have additional questions then providing as much of the code or psuedo code as possible will help.

shimonShouei commented 1 year ago

thanks, I tried your solution, and still the same. this is the code: btModel = tmt.getBERTopicModel(165,16) btModel.fit_transform(df.text.values) btModel.transform([new doc]) It returns the predictions for the original docs for some reason

drob-xx commented 1 year ago

Hmmm... It could be an issue with BERTopic. Why don't you run with just BERTopic (don't worry about the tuning) and see what happens. I can't think of anything that would cause this in TMT off the top of my head. But if it works in BERTopic without TMT then I can take a look.

shimonShouei commented 1 year ago

I did it. It works

drob-xx commented 1 year ago

Can I see your code and data?

shimonShouei commented 1 year ago

this is the colab https://colab.research.google.com/drive/1vJMC_KCjL8Sv_t8s3X1iC6xt0-1mavrp?usp=sharing this is the data annotated_target_topic_data.csv by the way, TopicTuner doesn't check the coherence of the model right?

thank you very much for the responsiveness!!

drob-xx commented 1 year ago

So I couldn't access your colab but I was able to reproduce the problem. It is because the umap facade class that gets created in TMT fixes its embedding output - so BERTopic never sees the new reduced embeddings - it just gets the old reduced embeddings. I'm finishing up a new version of TMT which does away with this (flawed) method.

There is a workaround -

1) Tune your tmt model 2) Get a BERTopic model newBTModel = tmt.getBERTopicModel(<YOUR MIN_CLUSTER_SIZE>, <YOUR MIN_SAMPLES>) 3) Replace the umap facade with the tmt reducer_model newBTModel.umap_model = tmt.reducer_model 4) run transform newpreds = newBTModel.transform(newtext)

I tested this and newpreds[1] should have the right number of predictions.

Sorry for the trouble. If it sounds ok to you I'm gonna leave this alone in this version b/c there has been a re-factoring and a substantial change in the way that the BERTopic model is being created which eliminates this problem. So rather than fix this in this version I prefer to just focus on the new release where this shouldn't be a problem (I'll add it to the new test cases :) ).

shimonShouei commented 1 year ago

Thanks, it works!! I am so sorry that the collab wasn't authorized! By the way, it needed to fit the newBTModel before transforming the new doc.

drob-xx commented 1 year ago

Great. Glad that worked.

drob-xx commented 1 year ago

I'm closing this now as it is fixed in the new release.