Deduplicate items across sources

fossar / selfoss

multipurpose rss reader, live stream, mashup, aggregation web application

https://selfoss.aditu.de

GNU General Public License v3.0

2.35k stars 343 forks source link

Deduplicate items across sources #1444

Open mrichtarsky opened 11 months ago

mrichtarsky commented 11 months ago

Hi,

selfoss only adds an item from a feed when it is not already present for that source. However, newspapers often have separate feeds for different topics. When you subscribe to multiple feeds, you can end up with the same article from multiple feeds/sources.

So it would be nice if selfoss could check whether the article is present regardless of source. This is usually ok since the ID is the URL to the article, which should be unique across sources.

I have implemented this change in behavior here, controlled by an ini parameter: https://github.com/mrichtarsky/selfoss/commit/f31bf4ff5091e8224c508200d1f42e915c921784

Would this be interesting for others as well?

Thanks and best regards, Martin

jtojnar commented 11 months ago

Thanks, that is interesting idea. I wonder if we could make it always enabled and have the item in multiple sources.

We would probably need to replace the source column in the items table with an m:n association table. Will need to check the performance implications.

davidoskky commented 11 months ago

This is a very nice idea, what are you using as identifier to deduplicate? The url? What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.

jtojnar commented 11 months ago

what are you using as identifier to deduplicate? The url?

The UID. Most commonly, this is the post URL but it is not required. For example blogger.com will use something like tag:blogger.com,1999:blog-6112936277054198647.post-403878284366003238.

What if the two feeds return a different content? Should not be an issue if you're using the full text recovery though.

We could have findAll return the source id in addition to item id and check whether the content and url matches when the source id does not, and only deduplicate it then.

That would also probably resolve the uid collisions.

The issue that items will be missing from some of the sources will still remain, though, which is why I would like to test the performance impact of having sources table in m:n relation to items.