CSDL-UMD / Rockwell

Rockwell uses the twitter authentication workflow to render a twitter like feed in order to collect information about the users interaction with their feed. It also has an attention check feature to ensure that the user is being observant of their feeds and not simply scrolling through with the intent of finishing quickly.
7 stars 2 forks source link

Implement new Collaborative Filtering algorithm #212

Closed glciampaglia closed 11 months ago

glciampaglia commented 1 year ago

In this new CF algorithm we will expand the definition of user--item matrix. Originally, inspired by the NHB article, the "items" included only domains of NG sources and the matrix was filled by counting the number of tweets or retweet that had a URL to any of those domains. (Each tweet/retweet representing an engagement by one of the participants.)

Once the matrix was defined, we apply TF/IDF formula from NHB:

${v}{u,d}=\frac{{\pi }{u,d}}{{\sum }{h}{\pi }{u,h}}{{\mathrm{log}}}\,\left(\frac{\pi }{{\sum }{u}{\pi }{u,d}}\right)$

Where the similarity was computed either with Pearson or Kendall correlation coefficient.

Now, we would like to introduce two modifications to this scheme:

  1. Include additional columns where the "items" correspond to the author of a tweet / retweet.
  2. Count, when available, also domains / authors found in the like.

To train the model, we will combine the hoaxy data snapshot from 2022-23, plus the engagements from "pilot1" & "pilot2" that we ran in Q1 of 2023 (pilot 1 was a one-time pull of home and user timeline, while pilot 2 was a repeated week-long pull).

To test the model, we will use 5-fold cross-validation and compute the following out-of-sample evaluation metrics:

Where each of these can be computed using either Pearson or Kendall's coefficients.

glciampaglia commented 1 year ago

The algorithm was implemented and we tested with a holdout set, and it shows that it gives good results on engagements only for precision@k (see slide).

Next steps:

glciampaglia commented 11 months ago

We implemented both CF and CF+D. For the evaluation, we created two separate task (#221, #222), so this issue can be closed.