buda-base / git-to-dbs

Git to CouchDB / Fuseki / CouchDB for BDRC Lib App
Apache License 2.0
0 stars 0 forks source link

Counterbalance popularity #46

Open roopeux opened 4 months ago

roopeux commented 4 months ago

Create a new field in mappings called freshness.
Get the first date of popularity data collection and give all docs published before this date the min value of pop_score, which is currently 0.4. Give docs published today value 1. For the docs published during the pop data collection, calculate freshness based on the number of days on a linear function between min pop_score and 1, later days creating higher values.

Explanation This makes new publications as popular as the most popular old texts for a while, giving them a chance to gain actual popularity.

eroux commented 4 months ago

I understand the use case but from an engineering perspective this is something I'd rather avoid having hard coded in the documents, because it's a field where the value changes all the time, so we would need to reindex everything every day for this field to work. Is there a way this could be calculated a query time? there's already a publicationDate field that indicates when the record was created.

Note that even if we implement that knowing it's going to be very imperfect, there's still different dates for a bibliographical record:

eroux commented 4 months ago

there's also the case of non-bibliographical records: I tend to think that the records that we're adding today are not that relevant compared to old records

eroux commented 4 months ago

what about having 2026-01-01 - date / 2026-01-01 - 2007-01-01 as the score? that way we can just update it in 2026

roopeux commented 4 months ago

I can calculate this at query time as you suggested. My idea was that it was not necessary to reindex every day because it does not have to be accurate, and while some texts are the newest, even if they were not published today, it would be ok to keep boosting them. But since latencies are in control, you don't have to do it.

eroux commented 4 months ago

added a scans_freshness field which is the date put in the interval [2016-05-01 - 2026-05-01], I think it should be doable to adjust it at query time so it boosts consistently over time