Can embeddings used to calculate the keyword to document distance be exposed as part of KeyBERT API?

MaartenGr / KeyBERT

Minimal keyword extraction with BERT

https://MaartenGr.github.io/KeyBERT/

MIT License

3.5k stars 345 forks source link

Can embeddings used to calculate the keyword to document distance be exposed as part of KeyBERT API? #113

Open Xin-Rong opened 2 years ago

MaartenGr commented 2 years ago

That is currently not possible as it requires saving quite a number of embeddings in KeyBERT which can quickly grow in size. Moreover, returning them as part of the API, for example through extract_keywords would make the API a bit less user-friendly to use due to the increasing number of hyperparameters and return options.

turian commented 2 years ago

I would like this feature too. Consider adding a SECOND method instead of cluttering extract_keywords.

The reason for this is because if you are doing keyphrase extraction over many documents, there needs to be a way to manipulate MMRs etc across multiple docs.

turian commented 2 years ago

Or, can you propose a workaround? e.g. a posthoc embedding extraction approach?

MaartenGr commented 2 years ago

Just to be sure I understood you correctly. You would like a method for extracting both document and word embeddings such that you can easily try out different parameters for extracting keywords right? Could you provide me with an example of how you would like to use this? Also, is this the same request as https://github.com/MaartenGr/KeyBERT/issues/127?

turian commented 2 years ago

It seems similar but don't completely understand what https://github.com/MaartenGr/KeyBERT/issues/127 is asking for.

What I would propose is a low-level method extract_keywords_full_return which returns everything, and then extract_keywords drops the other returns it doesn't need.

"Could you provide me with an example of how you would like to use this?" I have 2000 documents (URLs I have bookmarked and then converted HTML to text and removed boilerplate). I would like to autotag them. I want to play around with the autotag hyperaparameters. Also, if I add a new bookmark, I don't want my tag set to change entirely. Does this make sense?

MaartenGr commented 2 years ago

I am not entirely sure if I understood correctly. Do you mean something like this:

kw_model = KeyBERT()
metadata = kw_model .extract_metadata()
keywords = kw_model .extract_keywords(metadata=metadata)

turian commented 2 years ago

Rather, this:

kw_model = KeyBERT()
keywords, embeddings = kw_model .extract_keywords_full_return()

MaartenGr commented 2 years ago

Ah right, so assuming embeddings represent the embeddings of the documents, how would it help with your original question?

The reason for this is because if you are doing keyphrase extraction over many documents, there needs to be a way to manipulate MMRs etc across multiple docs.

The reason I am asking is that even if you return the embeddings, there is currently no feature in KeyBERT to directly use those embeddings and use different parameters for, for example, MMR.

turian commented 2 years ago

Since I am an experienced ML practitioner, I would manipulate the embeddings directly for my downstream usecase.

MaartenGr commented 2 years ago

Alright, I will do some experimentation to see what fits best with different use cases, like the issue I referred to before. It would be ideal to combine it with that suggested feature to split up generating embeddings with extracting keywords for easier experimentation.

MaartenGr commented 1 year ago

With the new v0.7 release, you can now prepare the embeddings beforehand in order to speed up fine-tuning the parameters of your model.