semmyk-research commented 2 years ago

_save_representative_docs() | get_representative_docs()

As I understand it, we have a flow as follows _bertopic.py: fit_transform(..) --> _cluster_embeddings(...) --> if(hdbscan_model) --> _save_representative_docs(...) --> {representative_docs}

Within the 'private' function _save_representative_docs(...), #3 documents are #randomly mapped to a topic. When get_representative_docs(...) is called, it retrieved the #3 docs persisted to the global dict _representative_docs__

In essence, get_representative_docs()
randomly retrieves #3 documents for a selected topic.

Use case In academic research, in generating topics and clustering, it is assistive for researchers to map and retrieve the documents for each topic/cluster.

The benefit of this is that researchers can **retrieve articles/abstracts for topics**
they are interested in, from a large corpus of literature search results.

Approach [1] in __save_representative_docs(), we perhaps get all docs rather than randomly 'pick' #3.

Concern: for large datasets, we might have a memory burden.

We can, of course, have an arg default of random #3 as-is, and an option of the number of docs (nr_docs) or all docs (all).
This might require adjusting fit_transform. My view is this might not be 'intuitive'.

[2] in _get_representative_docs(...), we can simply have: get_representative_docs(self, topic: int = None, nr_docs: int = None) -> List[str]:

For this, we test arg conditionality with appropriate arg checking. 
Perhaps the check_if_fitted() from bertopic_utils might suffice

Alternatively, we simply define a get_docs()

get_docs(self, topic: int = None, nr_docs: int = None) -> List[str]:
which will instantiate a _save_docs | _This might create multi-point of edit & failure_
return nr_docs
NB: if nr_docs is 'None' -> #3 | elif nr_docs is 'All' -> all docs for topic | else nr_docs -> nr_docs

I'm testing locally by extending the BERTopic class | snapshot of thought.

Semmyk

drob-xx commented 2 years ago

bertmodel.topics is a list with all document assignments. They are in the same order as the input docs. Is there a reason you can't use that?

semmyk-research commented 2 years ago

@drob-xx Thanks so much. Deeply appreciated. That was a great pointer: I've overlooked its importance. Looking again, I can see how _self.topics__ gets initialised and 'updated'.

Attributes:
    topics_ (List[int]) : The topics that are generated for each document after training or updating
                          the topic model. The most recent topics are tracked.

Given the use case and your pointer (to topics_), I'll revisit the alternative I suggested earlier. It seems it might be more appropriate for the use case. I need not continue tampering __save_representationdocs (although it worked for me).

As I understand it, an underlining philosophy of BERTopic (which carries over to KeyBERT is to ... basic, but powerful methods.

[Back to the use case] Unlike a computing researcher (my background after my engr'g training), a typical IS (information systems) researcher, of which I am in part, will not be keen on messing around with 'codes'. The same is somewhat applicable to social sciences/humanities researchers.
On that premise and what I assume BERTopic's philosophy to be, it might make logical sense to provide a method to

indicate a topic or list of topics interested in
extract the documents for those topics
display the Dataframe
allow saving the Dataframe to a file for further processing: such as csv or any other.
continue to work within Python for further comparison and extraction: such as doi et al from their initial literature search results.

[Output from Extend BERTopic Class] I'll be glad to do a pull request.

.

drob-xx commented 2 years ago

@semmyk-research Glad that was useful. In terms of extending functionality that is way above my pay grade and something to take up with @MaartenGr. In terms of the functionality you are looking for here's some code that should work - I haven't run this but beyond some minor typos and syntax issues you should be able to run it without problems:

Combine topics and docs in a single DataFrame - assuming that docs is a list of [str] documents and mymodel is an already fit BERTopic model.
```
myDF = pd.DataFrame()
myDF['doc'] = docs
myDF['topics'] = mymodel.topics_
```

> indicate a topic or list of topics interested in

interestingTopics = [3, 20, 5] interestingDocumentsDF = myDF[myDF['topics'].isin(interestingTopics)] interestingDocumentsDF.head()


> extract the documents for those topics

`xtractedDocs = interestingDocumentsDF['doc'].tolist()
`
> display the Dataframe

at this point xtractedDocs would be a list of [str]. If you wanted it to be a DataFrame you could just use interestingDocumentsDF

> allow saving the Dataframe to a file for further processing: such as csv or any other.

`interestingDocumentsDF.to_csv('a_path_and_file_name')`
to retrieve :
`interestingDocumentsDF = pd.read_csv('a_path_and_file_name)
`
> continue to work within Python for further comparison and extraction: such as doi et al from their initial literature search results.

You would capture that information above in separate columns of myDF

semmyk-research commented 2 years ago

@drob-xx Thanks. I'll take a look at your sample code later in the day. I just realised I did not share my proposed get_docs: I shared the snapshot only.

@MaartenGr : @drob-xx was spot on. I'm exploring the possibility of extending functionality. I've found BERTopic interesting and useful. I've spotted some areas I could chip in. This is one of them. I'll open another issue for 'extending' topic_over_time

[get_docs]

    ##// TODO: 
        isinstance(topic, int) and isinstance(nr_docs, int):
        return ... ...

MaartenGr commented 2 years ago

@semmyk-research Thank you for taking the time to go through all of this and exploring possible options.

A quick note, the .get_representative_docs function extract documents that are the best representative of a cluster according to the internal structure of HDBSCAN. Although we can increase that value of 3 documents, setting that to the size of a cluster will not give back all documents since not all documents are equally representative.

With respect to the proposed get_docs, I am not entirely sure that it warrants an entire function when the topics are already returned with:

topics, probs = topic_model.fit_transform(docs)

This also follows the scikit-learn convention of its transformers and as mentioned before, using topic_model.topics_ is also an option. Together, we can create a one-liner:

# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})

# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})

I am not sure about creating another function for what is essentially a one-liner. Although it is nice to have more features, it may result in choice overload and may inhibit the ease with which you find functions that suits one's needs.

semmyk-research commented 2 years ago

@MaartenGr Hum, I see where you're coming from. Valid reason and solid point you've got there: ... inhibit the ease with which you find functions. I guess that also speaks to #agility, #fluidity get_doc for me, though was more of leveraging BERTopic for rapid literature review. Of course, BERTopic has a vast application range.

@drob-xx @MaartenGr Thanks so much for engaging. I get to see BERTopic's versatility and robustness. The key takeaway for me

topics_ | topic
mapping docs to topics
under-the-hood bertopic
I can now make my get_doc(...) more pythonic with less codes!

MaartenGr commented 1 year ago

With the release of v0.13, there is now the option to get metadata from documents using .get_document_info(docs), this should make it much easier to get the data users are looking for without the need to go through some pandas manipulation.

MaartenGr / BERTopic

BERTopic: get_representative_docs(...) | Option to get all docs mapped to a topic beyond the default randomly selected #3 #811

_save_representative_docs() | get_representative_docs()

Concern: for large datasets, we might have a memory burden.