Open wuziyigit opened 5 months ago
There is a bunch of code in that issue, so I'm not sure which you are referring to. Could you share it?
I tried the following:
I first ran the basic BERTopic model:
vectorizer_model = CountVectorizer(stop_words="english")
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
topic_model = BERTopic(vectorizer_model=vectorizer_model, ctfidf_model=ctfidf_model, calculate_probabilities=True, verbose=True)
topics, probs = topic_model.fit_transform(enr_df_docs)
I then ran the estimate_effect
function in the comment:
def estimate_effect(topic_model,
docs: List[str],
topics: Union[int, List[int]],
metadata: pd.DataFrame,
y: str = "prevalence",
probs: np.ndarray = None,
estimator: Union[str, Callable] = None,
estimator_kwargs: Mapping[str, Any] = None) -> List[wrap.ResultsWrapper]:
""" Estimate the effect of metadata on topic prevalence and topic content
Arguments:
docs: The original list of documents on which the model was trained on
probs: A mxn probability matrix, *m* is the number of document and
*n* the number of topics. It represents the probabilities of all topics
across all documents.
topics: The topic(s) for which you want to estimate the effect of metadata on
metadata: The metadata in a dataframe. Make sure that the columns have the exact same
name as the elements in the estimator
y: The target, either "prevalence" (topic prevalence) or "content" (topic content)
estimator: Either the formula used in the estimator or a custom estimator.
When it is used as a formula, it follows R-style formulas, for example:
* 'prevalence ~ rating'
* 'prevalence ~ rating + day + rating:day'
Make sure that the target is either 'prevalence' or 'content'
The custom estimator should be a `statsmodels.formula.api`, currently,
`statsmodels.api` is not supported.
estimator_kwargs: The arguments needed within the estimator, needs at
least a "formula" argument
Returns:
fitted_estimators: List of fitted estimators for either topic prevalence or topic content
"""
data = metadata.loc[::]
data["topics"] = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
data["docs"] = docs
fitted_estimators = []
if isinstance(topics, int):
topics = [topics]
# As a proxy for the topic prevalence, we take the probability of a document
# belonging to a specific topic. We assume that a higher probability of a topic
# belonging to that topic also results in that document talking more about that topic
if y == "prevalence":
for topic in topics:
# Prepare topic prevalence,
# Exclude probs == 1 as no zero-one inflated beta regressions are currently avaible
data["prevalence"] = list(probs[:, topic])
data_filtered = data.loc[data.prevalence < 1, :]
# Either use a custom estimator or a pre-set model
if callable(estimator):
est = estimator(data=data_filtered, **estimator_kwargs).fit()
else:
est = smf.glm(estimator, data=data_filtered, family=sm.families.Gamma(link=sm.families.links.log())).fit()
fitted_estimators.append(est)
# Topic content is modeled on a document-level by calculating the document cTFIDF
# representation. Based on that representation, we calculate its cosine similarity
# with its topic cTFIDF representation. The assumption here, is that we expect different
# similarity scores if a covariate changes the topic content.
elif y == "content":
for topic in topics:
# Extract topic content and prevalence
selected_data = data.loc[data.topics == topic, :]
c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
selected_data["content"] = sim_matrix[:, topic+1]
# Either use a custom estimator or a pre-set model
if callable(estimator):
est = estimator(data=selected_data, **estimator_kwargs).fit()
else:
est = smf.glm(estimator, data=selected_data,
family=sm.families.Gamma(link=sm.families.links.log())).fit() # perhaps remove the gamma + link?
fitted_estimators.append(est)
return fitted_estimators
The code on prevalence is working well
ests = estimate_effect(topic_model=topic_model,
topics=[-1, 1],
metadata=metadata,
docs=enr_df_docs,
probs=probs,
estimator="prevalence ~ score",
y="prevalence")
print([est.summary() for est in tests])
But the code on content returns an error
ests = estimate_effect(topic_model=topic_model,
topics=[-1, 0],
metadata=metadata,
docs=enr_df_docs,
probs=probs,
estimator="content ~ score",
y="content")
print([est.summary() for est in ests])
I guess I messed up something here, but I didn't really change any code
elif y == "content":
for topic in topics:
# Extract topic content and prevalence
selected_data = data.loc[data.topics == topic, :]
c_tf_idf_per_doc, _ = topic_model._c_tf_idf(pd.DataFrame({"Document": selected_data.docs.tolist()}), fit=False)
sim_matrix = cosine_similarity(c_tf_idf_per_doc, topic_model.c_tf_idf)
selected_data["content"] = sim_matrix[:, topic+1]
Sorry for the trouble and thanks for your response in advance.
I think you might need to change .c_tf_idf
to .c_tf_idf_
instead in order to get the correct variable. I believe it was updated a while ago which explains your issue.
I'm following the steps in this Issue to test how the meta data influence the prevalence/content of topics
https://github.com/MaartenGr/BERTopic/issues/360
But I get AttributeError: 'BERTopic' object has no attribute 'c_tf_idf' when running
ests = estimate_effect(topic_model=topic_model, topics=[-1, 0], metadata=metadata, docs=enr_df_docs, probs=probs, estimator="content ~ score", y="content") print([est.summary() for est in ests])