Open igor17400 opened 7 months ago
Here is an update on what I did.
inside the __getitem__
on rec_dataset.py
I added the following condition:
if history.size == 1 and history[0] == '':
history = self._initialize_cold_start()
else:
history = self.news.loc[history]
where _initialize_cold_start
is defined as the following:
def _initialize_cold_start(self):
"""
In cold start cases, history can be empty thus we need to
add a dataframe with empty values for the embedding.
"""
# Initialize an empty DataFrame with specified columns
history = pd.DataFrame(columns=['title', 'abstract', 'sentiment_class', 'sentiment_score'])
# Append a new row with the specified values
history = history.append({
'title': '',
'abstract': '',
'sentiment_class': 0,
'sentiment_score': 0.0
}, ignore_index=True)
# Explicitly set the data types for the entire DataFrame
history = history.astype({
'title': 'object',
'abstract': 'object',
'sentiment_class': 'int64',
'sentiment_score': 'float64'
})
return history
This may be useful for other people who are trying to solve the same problem.
Hi @igor17400,
Thanks for raising this issue. Indeed, the original code did not work with empty user histories, but implementing this functionality should be useful for many users.
I think your solution is simple and elegant. I can have a look at it over the weekend, to test it with both pretrained word embeddings and PLMs, and streamline it across the data preprocessing functions for all datasets. Would you like to open a PR with your proposed solution?
Hi @andreeaiana,
I'm in the process of implementing PP-Rec, as outlined in PR https://github.com/andreeaiana/newsreclib/pull/12. I'm currently working through it, ensuring that the blocks are accurate and checking the scores and behaviors for MIND large and Adressa. Thus, is not ready to be merge. However, just to let you know that in this PR, I'm adding the _initialize_cold_start
idea along with the previously mentioned spinner for score calculation to avoid terminal freeze.
Great, thanks for letting me know and for your contributions to the library.
@andreeaiana I noticed you filter out cold start users (those with empty histories). Why is that?
I'm wondering if it might be better to use a strategy like the one I previously mentioned (_initialize_cold_start) to pre-populate these cold start users with some placeholder news articles, rather than removing them. But maybe my thinking is wrong.
@andreeaiana I noticed you filter out cold start users (those with empty histories). Why is that?
I'm wondering if it might be better to use a strategy like the one I previously mentioned (_initialize_cold_start) to pre-populate these cold start users with some placeholder news articles, rather than removing them. But maybe my thinking is wrong.
I think that's a good idea, we can try it. I know that some models originally do that, but not all of them.
Hello! I'm currently handling a dataset where the
histories
column might initially be empty, especially for users who are accessing the system for the first time.Given this context, I'm seeking advice on how to approach a particular situation highlighted in the code found at this GitHub link. The process involves tokenizing the titles of previously clicked news articles, but I'm facing a potential cold start issue for new users without any history. In these instances, should I consider tokenizing empty titles, abstracts, etc.?