FreshRSS / FreshRSS

A free, self-hostable news aggregator…
https://freshrss.org
GNU Affero General Public License v3.0
9.61k stars 821 forks source link

[Bug] Category purge settings appear to be ignored #6863

Open posita opened 5 days ago

posita commented 5 days ago

Describe the bug

Purge policies associated with categories appear to have no effect. It's unclear how a feed in the category is supposed to adopt the categories policies, since the feed can either select "by default" or its own. I assumed that a feed purge policy of "by default" would defer to the policy of the category it was in, but that does not appear to be the case.

To Reproduce

  1. Create a category and set its purge policy to never delete unread articles, keep all articles with a max age of one hour, and the minimum number to keep to 0. Do not set any other policies.
  2. Add https://www.youtube.com/@NFL to that category and mark all of its articles as read.
  3. Wait 1.25 hours.
  4. Run ./app/actualize_script.php or ./cli/actualize-user.php --user <youruser>.

Expected behavior

I would expect any read article discovered more than one hour ago to be purged, but this does not appear to be the case.

FreshRSS version

1.24.3

Environment information

I can add these later if needed. I saw no errors, either in the UI or in the logs.

Additional context

6601 is possibly related to what I'm experiencing with this issue.

Alkarex commented 5 days ago

The date used to purge articles is the date when articles were discovered, not their declared date, by design

Alkarex commented 5 days ago

However, you can automatically mark as read articles based on their publication date, using filter actions

posita commented 5 days ago

Purge means delete, though, yes? Also, it appears not to work, even using the discovery date to determine the age. (Original report edited to reflect the described behavior.)

Alkarex commented 5 days ago

I would expect any read article discovered more than one hour ago to be purged

No, that would mostly never be the case.

Purge (yes, meaning delete) is only triggered randomly (as it is resource demanding). It can be triggered by hand though, with the Purge now button in the archiving options. Note that several other factors might explain what you observe. In particular, FreshRSS will keep at least the articles which are still in the upstream feed (otherwise the logic to find out what is a new article or not would be challenging) plus a bit more to be safe (to avoid re-appearing articles).

What you want would be better achieved by using the filters and policies to automatically mark as read articles, instead of deleting them.

posita commented 5 days ago

Okay, the behavior you describe (even if intended) is very hard to tease out of the UI. It's strange to have a purge policy that is randomly ignored due to other, unstated factors. This might be better hinted at by renaming the feature, perhaps to something like "random purge" or "opportunistic purge" or "possibly purge" or something better than what I'm able to arrive at.

Is there a way to force purge via the CLI/cron? The reason I thought this was a bug is that I have feeds that have similar settings to what I've originally posted (albeit with a max age of four weeks instead of one hour), and they have hundreds of articles, dating back months (even after doing a "purge now"). I was trying to isolate a more easily reproduced case.

UPDATE 1: If what I think you're saying is true, my months-old articles could still be kept around if they appear in the upstream feed, which could surface things going back years. Is that right? If so, is there any limit/cap to this?

UPDATE 2: I don't think the upstream bit is getting in my way. For example, the following shows no date older than 2024-08-14:

$ curl --location https://www.eff.org/rss/updates.xml | xmllint --xpath '//pubDate/text()' -
Wed, 02 Oct 2024 21:11:26 +0000
...
Wed, 14 Aug 2024 00:12:45 +0000

I have articles prior to that (e.g., this one), with a max age of four weeks. My entry row shows a date of 1722279000 which translates to Mon Jul 29 11:50:00 AM PDT 2024 and an id of 1722322846[.]349428, which translates to Tue Jul 30 12:00:46 AM PDT 2024. (I'm assuming it's the latter that determines the "discovery" date, since I don't see another column that could reasonably encode that?)

Alkarex commented 5 days ago

the behavior you describe (even if intended) is very hard to tease out of the UI.

Yes, we should add some hints and a section in the documentation

my months-old articles could still be kept around if they appear in the upstream feed, which could surface things going back years. Is that right?

Indeed, that is correct. The idea though is that those articles should be marked as read and therefore not appear in your normal (unread) views. In other words: use mark as read (which is precise) and not delete (which you can think of as what many systems are doing such as cleaning spam folder, automatic maintenance of databases etc. and which are done based on heuristics guided by some hard and soft constraints).

If so, is there any limit/cap to this?

No, because otherwise the articles would be re-added again at the next refresh

Alkarex commented 5 days ago

I'm assuming it's the latter that determines the "discovery" date, since I don't see another column that could reasonably encode that?

Correct

Alkarex commented 5 days ago

Is there a way to force purge via the CLI/cron?

Not yet, but that would be easy to add on the model of https://github.com/FreshRSS/FreshRSS/blob/edge/cli/db-optimize.php PR welcome

brpaz commented 4 days ago

By reading this issue just discovered I have been using purge wrong for more than an 1y lol. That explains why my feed keeps growing, even with "aggressive" purge options set.

I had the same interpretation as @posita .

Time to configure filter actions, then.

Alkarex commented 4 days ago

I have articles prior to that

What lastSeen value has that article?

posita commented 1 day ago

I have articles prior to that

What lastSeen value has that article?

Looks like Wed Sep 11 02:00:21 AM CDT 2024 (1726038021), which appears within the four weeks boundary. If I understand correctly, the date used for determining purge eligibility is the later of last seen and id (discovery date)?

Alkarex commented 1 day ago

lastSeen is always newer than or equal to id. The logic is that when an article has not been seen for more than the defined limit, it may get eligible to deletion (minus some additional criteria).

https://github.com/FreshRSS/FreshRSS/blob/ca7221e885eae3ff075ea2c05798ceb4cec24daf/app/Models/EntryDAO.php#L676-L715