Implement new Collaborative Filtering algorithm

glciampaglia commented 1 year ago

In this new CF algorithm we will expand the definition of user--item matrix. Originally, inspired by the NHB article, the "items" included only domains of NG sources and the matrix was filled by counting the number of tweets or retweet that had a URL to any of those domains. (Each tweet/retweet representing an engagement by one of the participants.)

Once the matrix was defined, we apply TF/IDF formula from NHB:

${v}{u,d}=\frac{{\pi }{u,d}}{{\sum }{h}{\pi }{u,h}}{{\mathrm{log}}}\,\left(\frac{\pi }{{\sum }{u}{\pi }{u,d}}\right)$

Where the similarity was computed either with Pearson or Kendall correlation coefficient.

Now, we would like to introduce two modifications to this scheme:

Include additional columns where the "items" correspond to the author of a tweet / retweet.
Count, when available, also domains / authors found in the like.

To train the model, we will combine the hoaxy data snapshot from 2022-23, plus the engagements from "pilot1" & "pilot2" that we ran in Q1 of 2023 (pilot 1 was a one-time pull of home and user timeline, while pilot 2 was a repeated week-long pull).

To test the model, we will use 5-fold cross-validation and compute the following out-of-sample evaluation metrics:

Precision
RMSE
Fraction of Trustworthy domains
Average NG reliability score

Where each of these can be computed using either Pearson or Kendall's coefficients.

glciampaglia commented 1 year ago

The algorithm was implemented and we tested with a holdout set, and it shows that it gives good results on engagements only for precision@k (see slide).

Next steps:

[ ] Cross-validation instead of holdout set;
[ ] Merge the NG data to compute Trustworthiness@k too;
[x] Implement CF+D re-ranking formula from NHB.
[ ] Plot Precision@K and Trustworthiness@K for reverse chrono, CF (with recency), CF+D (with recency)

glciampaglia commented 11 months ago

We implemented both CF and CF+D. For the evaluation, we created two separate task (#221, #222), so this issue can be closed.

CSDL-UMD / Rockwell

Implement new Collaborative Filtering algorithm #212