Closed glciampaglia closed 11 months ago
The algorithm was implemented and we tested with a holdout set, and it shows that it gives good results on engagements only for precision@k (see slide).
Next steps:
We implemented both CF and CF+D. For the evaluation, we created two separate task (#221, #222), so this issue can be closed.
In this new CF algorithm we will expand the definition of user--item matrix. Originally, inspired by the NHB article, the "items" included only domains of NG sources and the matrix was filled by counting the number of tweets or retweet that had a URL to any of those domains. (Each tweet/retweet representing an engagement by one of the participants.)
Once the matrix was defined, we apply TF/IDF formula from NHB:
${v}{u,d}=\frac{{\pi }{u,d}}{{\sum }{h}{\pi }{u,h}}{{\mathrm{log}}}\,\left(\frac{\pi }{{\sum }{u}{\pi }{u,d}}\right)$
Where the similarity was computed either with Pearson or Kendall correlation coefficient.
Now, we would like to introduce two modifications to this scheme:
To train the model, we will combine the hoaxy data snapshot from 2022-23, plus the engagements from "pilot1" & "pilot2" that we ran in Q1 of 2023 (pilot 1 was a one-time pull of home and user timeline, while pilot 2 was a repeated week-long pull).
To test the model, we will use 5-fold cross-validation and compute the following out-of-sample evaluation metrics:
Where each of these can be computed using either Pearson or Kendall's coefficients.