Dealing with cold start users click history

igor17400 commented 7 months ago

Hello! I'm currently handling a dataset where the histories column might initially be empty, especially for users who are accessing the system for the first time.

Given this context, I'm seeking advice on how to approach a particular situation highlighted in the code found at this GitHub link. The process involves tokenizing the titles of previously clicked news articles, but I'm facing a potential cold start issue for new users without any history. In these instances, should I consider tokenizing empty titles, abstracts, etc.?

igor17400 commented 7 months ago

Here is an update on what I did.

inside the __getitem__ on rec_dataset.py I added the following condition:

if history.size == 1 and history[0] == '':
            history = self._initialize_cold_start()
        else:
            history = self.news.loc[history]

where _initialize_cold_start is defined as the following:

def _initialize_cold_start(self):
        """
        In cold start cases, history can be empty thus we need to 
        add a dataframe with empty values for the embedding.
        """
        # Initialize an empty DataFrame with specified columns
        history = pd.DataFrame(columns=['title', 'abstract', 'sentiment_class', 'sentiment_score'])

        # Append a new row with the specified values
        history = history.append({
            'title': '', 
            'abstract': '', 
            'sentiment_class': 0,
            'sentiment_score': 0.0
        }, ignore_index=True)

        # Explicitly set the data types for the entire DataFrame
        history = history.astype({
            'title': 'object',
            'abstract': 'object',
            'sentiment_class': 'int64',
            'sentiment_score': 'float64'
        })

        return history

This may be useful for other people who are trying to solve the same problem.

andreeaiana commented 7 months ago

Hi @igor17400,

Thanks for raising this issue. Indeed, the original code did not work with empty user histories, but implementing this functionality should be useful for many users.

I think your solution is simple and elegant. I can have a look at it over the weekend, to test it with both pretrained word embeddings and PLMs, and streamline it across the data preprocessing functions for all datasets. Would you like to open a PR with your proposed solution?

igor17400 commented 7 months ago

Hi @andreeaiana,

I'm in the process of implementing PP-Rec, as outlined in PR https://github.com/andreeaiana/newsreclib/pull/12. I'm currently working through it, ensuring that the blocks are accurate and checking the scores and behaviors for MIND large and Adressa. Thus, is not ready to be merge. However, just to let you know that in this PR, I'm adding the _initialize_cold_start idea along with the previously mentioned spinner for score calculation to avoid terminal freeze.

andreeaiana commented 7 months ago

Great, thanks for letting me know and for your contributions to the library.

igor17400 commented 7 months ago

@andreeaiana I noticed you filter out cold start users (those with empty histories). Why is that?

Link to the code

I'm wondering if it might be better to use a strategy like the one I previously mentioned (_initialize_cold_start) to pre-populate these cold start users with some placeholder news articles, rather than removing them. But maybe my thinking is wrong.

andreeaiana commented 7 months ago

@andreeaiana I noticed you filter out cold start users (those with empty histories). Why is that?

Link to the code

I'm wondering if it might be better to use a strategy like the one I previously mentioned (_initialize_cold_start) to pre-populate these cold start users with some placeholder news articles, rather than removing them. But maybe my thinking is wrong.

I think that's a good idea, we can try it. I know that some models originally do that, but not all of them.

andreeaiana / newsreclib

Dealing with cold start users click history #11