Closed semmyk-research closed 1 year ago
bertmodel.topics is a list with all document assignments. They are in the same order as the input docs. Is there a reason you can't use that?
@drob-xx Thanks so much. Deeply appreciated. That was a great pointer: I've overlooked its importance. Looking again, I can see how _self.topics__ gets initialised and 'updated'.
Attributes:
topics_ (List[int]) : The topics that are generated for each document after training or updating
the topic model. The most recent topics are tracked.
Given the use case and your pointer (to topics_), I'll revisit the alternative I suggested earlier. It seems it might be more appropriate for the use case. I need not continue tampering __save_representationdocs (although it worked for me).
As I understand it, an underlining philosophy of BERTopic (which carries over to KeyBERT is to ... basic, but powerful methods.
[Back to the use case]
Unlike a computing researcher (my background after my engr'g training), a typical IS (information systems) researcher, of which I am in part, will not be keen on messing around with 'codes'. The same is somewhat applicable to social sciences/humanities researchers.
On that premise and what I assume BERTopic's philosophy to be, it might make logical sense to provide a method to
[Output from Extend BERTopic Class] I'll be glad to do a pull request.
.
@semmyk-research Glad that was useful. In terms of extending functionality that is way above my pay grade and something to take up with @MaartenGr. In terms of the functionality you are looking for here's some code that should work - I haven't run this but beyond some minor typos and syntax issues you should be able to run it without problems:
myDF = pd.DataFrame()
myDF['doc'] = docs
myDF['topics'] = mymodel.topics_
> indicate a topic or list of topics interested in
interestingTopics = [3, 20, 5] interestingDocumentsDF = myDF[myDF['topics'].isin(interestingTopics)] interestingDocumentsDF.head()
> extract the documents for those topics
`xtractedDocs = interestingDocumentsDF['doc'].tolist()
`
> display the Dataframe
at this point xtractedDocs would be a list of [str]. If you wanted it to be a DataFrame you could just use interestingDocumentsDF
> allow saving the Dataframe to a file for further processing: such as csv or any other.
`interestingDocumentsDF.to_csv('a_path_and_file_name')`
to retrieve :
`interestingDocumentsDF = pd.read_csv('a_path_and_file_name)
`
> continue to work within Python for further comparison and extraction: such as doi et al from their initial literature search results.
You would capture that information above in separate columns of myDF
@drob-xx Thanks. I'll take a look at your sample code later in the day. I just realised I did not share my proposed get_docs: I shared the snapshot only.
@MaartenGr : @drob-xx was spot on. I'm exploring the possibility of extending functionality. I've found BERTopic interesting and useful. I've spotted some areas I could chip in. This is one of them. I'll open another issue for 'extending' topic_over_time
[get_docs]
##// TODO:
isinstance(topic, int) and isinstance(nr_docs, int):
return ... ...
@semmyk-research Thank you for taking the time to go through all of this and exploring possible options.
A quick note, the .get_representative_docs
function extract documents that are the best representative of a cluster according to the internal structure of HDBSCAN. Although we can increase that value of 3 documents, setting that to the size of a cluster will not give back all documents since not all documents are equally representative.
With respect to the proposed get_docs
, I am not entirely sure that it warrants an entire function when the topics are already returned with:
topics, probs = topic_model.fit_transform(docs)
This also follows the scikit-learn convention of its transformers and as mentioned before, using topic_model.topics_
is also an option. Together, we can create a one-liner:
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})
I am not sure about creating another function for what is essentially a one-liner. Although it is nice to have more features, it may result in choice overload and may inhibit the ease with which you find functions that suits one's needs.
@MaartenGr Hum, I see where you're coming from. Valid reason and solid point you've got there: ... inhibit the ease with which you find functions. I guess that also speaks to #agility, #fluidity get_doc for me, though was more of leveraging BERTopic for rapid literature review. Of course, BERTopic has a vast application range.
@drob-xx @MaartenGr Thanks so much for engaging. I get to see BERTopic's versatility and robustness. The key takeaway for me
With the release of v0.13, there is now the option to get metadata from documents using .get_document_info(docs)
, this should make it much easier to get the data users are looking for without the need to go through some pandas manipulation.
_save_representative_docs() | get_representative_docs()
As I understand it, we have a flow as follows
_bertopic.py: fit_transform(..) --> _cluster_embeddings(...) --> if(hdbscan_model) --> _save_representative_docs(...) --> {representative_docs}
Within the 'private' function _save_representative_docs(...), #3 documents are #randomly mapped to a topic. When get_representative_docs(...) is called, it retrieved the #3 docs persisted to the global dict _representative_docs__
Use case In academic research, in generating topics and clustering, it is assistive for researchers to map and retrieve the documents for each topic/cluster.
Approach [1] in __save_representative_docs(), we perhaps get all docs rather than randomly 'pick' #3.
Concern: for large datasets, we might have a memory burden.
[2] in _get_representative_docs(...), we can simply have: get_representative_docs(self, topic: int = None, nr_docs: int = None) -> List[str]:
Alternatively, we simply define a get_docs()
I'm testing locally by extending the BERTopic class | snapshot of thought.