Clear duplicates - Githubissues

gllmhyt commented 9 years ago

Here's the situation:

I subscribed to a few planets (like planet-libre.org)
I subscribed to a few blogs that put some of its articles (but not all of them) on some of those planets

Once in a while, I have duplicates : the article of the planet subscription and the one of the blog subscription.

I'm wondering if it's possible to think about a way to "clear" the displayed articles in those cases and to display only one of them (like the original one, not the copies on the planets).

Alkarex commented 9 years ago

Thanks for the suggestion. I have been thinking of introducing such an option for a while, since it is also something I need. In your case, it is the same URL for both copies of the article, isn't it?

Wanabo commented 8 years ago

Any progress on this? I experience a lot of duplicate articles myself as well. Case 1 - duplicates from same rss feed source Case 2 - duplicates from multiple rss feed sources. Remove duplicates from source in case 1. Check if an article exists in one of the sources case 2. If exist then reject article.

In case 2 it would be nice to compare articles and choose the most complete article. (check for number of words, check for images)

Alkarex commented 8 years ago

@Wanabo In your case, how are your duplicates more precisely ? Do they have the exact same:

URL?
Title?
Content?

Wanabo commented 8 years ago

URL, may differ because the same article comes from different sources but are the same (=different websites) URL, is the same and comes from different sources BUT from the same website which has for example a feed for Sport and a feed for Lifestyle. The same article is in both feeds. Title see URL. Content see URL.

Image of problem

gerhard-tinned commented 8 years ago

Hi,

My Situation is similar. My example is the following.

The option on some sites offer to subscribe to ALL articles or by category. Most of the time i am not interested in all but more then one categories. Lets say 2.

So I setup two rss url's, one for each category. But some articles are assigned to both categories... Resulting in 100% duplicate entries across two RRS url's.

My guess is that this is one of the easier duplicates to filter.

Liandriz commented 8 years ago

Bump on this, any news to an antidupe posts?

For rivers/planets, a dupe is where "guid", "title", "content_bin", "link" and finally "hash" have to be the same : if "content_bin" is not the same, this is a different comment from a river.

dupes

Here, lines 2&3 are dupes, 1&4 are same link, but with different comment, from different sources.

2 options would be great :

delete "full" dupes, where "guid", "title", "content_bin", "link" and finally "hash" (didn't check where hash is calculated) are the same,
delete "simple" dupes, where only "title" and "link" are the same.

Possible?

Alkarex commented 8 years ago

@Liandriz Yes, this is a very desirable feature, which I would like to implement. I am thinking to implement an option to automatically mark an article as read when it is a duplicate of one already in the database.

Alkarex commented 8 years ago

But there are several details to take into account, for instance which version(s) should be marked as read, in particular when feeds are refreshed in a random order. It would probably require to specify which feed is the reference.

KhairulA commented 7 years ago

One possible solution is to make the categories sortable so that there's a hierarchy. We can then use the hierarchy to determine which source is the reference. Works like a "first-come-first-served" concept.

Kaan88 commented 6 years ago

Any updates on this? Maybe something like inoreader could be implemented.

Jean-Mich-Much commented 6 years ago

Hello, same, I would like this very useful function :) !

gerhard-tinned commented 6 years ago

+1

thomase1993 commented 6 years ago

Hi Alkarex, I wrote a query to find the duplicated titles. I plan to make a cron-job to set the duplicates to is_read = true. It seems that my mariadb version does not hava a rank funktion yet and that is the reason why I use the CASE WHEN.

SELECT *
FROM   (SELECT ( CASE title
                   WHEN @curtype THEN @currow := @currow + 1
                   ELSE @currow := 1
                        AND @curtype := title
                 end ) + 1 AS rank,
               freshrss_thomas_entry.*
        FROM   `freshrss_thomas_entry`,
               (SELECT @currow := 0,
                       @curtype := '') r
        ORDER  BY title DESC) AS o
WHERE  rank > 1

If someone is interested in a more complete solution let me know. I can not write php but I am able to use the database.

Frenzie commented 6 years ago

I'm not really bothered by it regardless but titles overlap all the time. URLs tend not to. I'd do either just URL or possibly URL and title. Moreover, in the cases where there's the most overlap (like Planet Debian) only the URLs match. So I'd say title is simultaneously false positive heaven yet almost never matching when you want it to.

@Kaan88 How does that inoreader screenshot you post work exactly? Is it global? Per feed?

Alkarex commented 6 years ago

Example of duplicate from https://www.clubic.com/articles.rss

    <item>
      <title>Réseaux LoRa &amp; Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
      <description> [...]</description>
      <pubDate>Sun, 03 Jun 2018 18:45:00 +0200</pubDate>
      <link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
      <guid isPermaLink="false">843861</guid>
    </item>
    <item>
      <title>Réseaux LoRa &amp; Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
      <description>On estime que d'ici 2020, ce sont plus de 50 000 Go de données qui transiteront entre nos machines chaque seconde. Ces échanges massifs se font en partie via des réseaux sans fil, les plus répandus ét [...]</description>
      <pubDate>Sun, 03 Jun 2018 18:30:00 +0200</pubDate>
      <link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
      <guid isPermaLink="false">843857</guid>
    </item>

Jean-Mich-Much commented 5 years ago

Coucou, un petit up car c'est une fonctionnalité importante qui manque à freshrss, encore merci pour votre travail :)

Hello, a little up because it is an important feature that is missing, thank you again for your work :)

Alkarex commented 5 years ago

@Jean-Mich-Much Oui, c'est aussi quelque chose que je souhaite moi aussi ajouter assez vite :-)

ghost commented 3 years ago

Bonjour, du nouveau sur cette fonctionnalité ? Hello, any news on this feature ?

albrox commented 3 years ago

Easy idea maybe... a cron that filters by the guid HAVING COUNT(guid) > 1 and set this to read?

vonguyenkhang commented 3 years ago

a messy hack to ignore new entries with the same title (case-insensitive):

`\app\Models\EntryDAO.php`

// $sql = $this->sqlIgnoreConflict(
    // 'INSERT INTO `_' . ($useTmpTable ? 'entrytmp' : 'entry') . '` (id, guid, title, author, '
    // . ($this->isCompressed() ? 'content_bin' : 'content')
    // . ', link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags) '
    // . 'VALUES(:id, :guid, :title, :author, '
    // . ($this->isCompressed() ? 'COMPRESS(:content)' : ':content')
    // . ', :link, :date, :last_seen, '
    // . $this->sqlHexDecode(':hash')
    // . ', :is_read, :is_favorite, :id_feed, :tags)');
$sql = $this->sqlIgnoreConflict(
    'INSERT INTO `_' . ($useTmpTable ? 'entrytmp' : 'entry') . '` (id, guid, title, author, '
    . ($this->isCompressed() ? 'content_bin' : 'content')
    . ', link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags) '
    . 'SELECT :id, :guid, :title, :author, '
    . ($this->isCompressed() ? 'COMPRESS(:content)' : ':content')
    . ', :link, :date, :last_seen, '
    . $this->sqlHexDecode(':hash')
    . ', :is_read, :is_favorite, :id_feed, :tags '
    . 'WHERE NOT EXISTS (SELECT title '
    . 'FROM `_' . ($useTmpTable ? 'entrytmp' : 'entry') . '` '
    . 'WHERE  UPPER(title)=:upper_title )');
$this->addEntryPrepared = $this->pdo->prepare($sql);
}
if ($this->addEntryPrepared) {
$this->addEntryPrepared->bindParam(':id', $valuesTmp['id']);
$valuesTmp['guid'] = substr($valuesTmp['guid'], 0, 760);
$valuesTmp['guid'] = safe_ascii($valuesTmp['guid']);
$this->addEntryPrepared->bindParam(':guid', $valuesTmp['guid']);
$valuesTmp['title'] = mb_strcut($valuesTmp['title'], 0, 255, 'UTF-8');
$valuesTmp['title'] = safe_utf8($valuesTmp['title']);
$this->addEntryPrepared->bindParam(':title', $valuesTmp['title']);
$upper_title = strtoupper($valuesTmp['title']);
$this->addEntryPrepared->bindParam(':upper_title', $upper_title);
$valuesTmp['author'] = mb_strcut($valuesTmp['author'], 0, 255, 'UTF-8');
$valuesTmp['author'] = safe_utf8($valuesTmp['author']);
$this->addEntryPrepared->bindParam(':author', $valuesTmp['author']);
$valuesTmp['content'] = safe_utf8($valuesTmp['content']);
$this->addEntryPrepared->bindParam(':content', $valuesTmp['content']);
$valuesTmp['link'] = substr($valuesTmp['link'], 0, 1023);
$valuesTmp['link'] = safe_ascii($valuesTmp['link']);
$this->addEntryPrepared->bindParam(':link', $valuesTmp['link']);
$valuesTmp['date'] = min($valuesTmp['date'], 2147483647);
$this->addEntryPrepared->bindParam(':date', $valuesTmp['date'], PDO::PARAM_INT);
if (empty($valuesTmp['lastSeen'])) {
    $valuesTmp['lastSeen'] = time();
}

`\app\Models\EntryDAOSQLite.php`

// $sql = '
// DROP TABLE IF EXISTS `tmp`;
// CREATE TEMP TABLE `tmp` AS
// SELECT id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
// FROM `_entrytmp`
// ORDER BY date;
// INSERT OR IGNORE INTO `_entry`
// (id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags)
// SELECT rowid + (SELECT MAX(id) - COUNT(*) FROM `tmp`) AS id,
// guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
// FROM `tmp`
// ORDER BY date;
// DELETE FROM `_entrytmp` WHERE id <= (SELECT MAX(id) FROM `tmp`);
// DROP TABLE IF EXISTS `tmp`;
// ';       
$sql = '
DROP TABLE IF EXISTS `tmp`;
CREATE TEMP TABLE `tmp` AS
SELECT id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
FROM `_entrytmp`
ORDER BY date;
INSERT OR IGNORE INTO `_entry`
(id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags)
SELECT rowid + (SELECT MAX(id) - COUNT(*) FROM `tmp`) AS id,
guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
FROM `tmp`
WHERE NOT EXISTS ( SELECT title FROM `_entry` WHERE UPPER(`tmp`.`title`)=UPPER(`_entry`.`title`) )
ORDER BY date;
DELETE FROM `_entrytmp` WHERE id <= (SELECT MAX(id) FROM `tmp`);
DROP TABLE IF EXISTS `tmp`;
';

WARNINGS:

not tested with entries from different sources
only has code for SQLite ; need similar hack for MySQL & PGSQL

Alkarex commented 3 years ago

Filtering some duplicates based on the title (in the same feed) will land with https://github.com/FreshRSS/FreshRSS/pull/3303

kosukesando commented 1 year ago

Hi, I have the 'if identical title exists' setting turned on for top 100 articles, but it seems to not work for my environment/feeds. Here is one example of two items from the RSS feed and the corresponding RSS feed file. Do you have any idea as to why this may be? Running in a Docker container, version is 1.21.0

<item>
    <title>河野デジタル相 マイナカード窓口視察 “総点検通じ信頼回復”</title>
    <link>http://www3.nhk.or.jp/news/html/20230709/k10014123251000.html</link>
                <description><![CDATA[河野デジタル大臣は、マイナンバーカードの手続きの窓口を担う兵庫県の自治体を視察し、一連のトラブルによるマイナンバー制度への不信感について、現在、自治体などで進められている総点検を通じて信頼の回復を目指す考えを示しました。]]></description>
    <pubDate>Sun, 09 Jul 2023 05:31:00 +0000</pubDate>
    <guid isPermaLink="false">1688883303384302</guid>
</item>
<item>
    <title>河野デジタル相 マイナカード窓口視察 “総点検通じ信頼回復”</title>
    <link>http://www3.nhk.or.jp/news/html/20230709/k10014123251000.html</link>
                <description><![CDATA[河野デジタル大臣は、マイナンバーカードの手続きの窓口を担う兵庫県の自治体を視察し、一連のトラブルによるマイナンバー制度への不信感について、現在、自治体などで進められている総点検を通じて信頼の回復を目指す考えを示しました。]]></description>
    <pubDate>Sun, 09 Jul 2023 05:31:00 +0000</pubDate>
    <guid isPermaLink="false">1688883303384301</guid>
</item>

9e8vIEHL.txt

pthoelken commented 9 months ago

Is there any update in this case? It's really anoyin bc the news portal put a news post in category a and also in category b. When I supscribe both categories (for reasons) I got a lot of duplicates. @Alkarex

vonguyenkhang commented 9 months ago

Is there any update in this case? It's really anoyin bc the news portal put a news post in category a and also in category b. When I supscribe both categories (for reasons) I got a lot of duplicates. @Alkarex

I think the logic to filter the duplicate entries is tricky, because as you say, there are some news sources that put the same entry with the same title and same link in multiple feeds, but there are also sources that use the same titles for the entries but the contents are totally different, for example https://entware.net/, and there's no rule that each entry title needs to be unique, so we can't fault anyone.

In FreshRSS, there's an option "Mark an article as read… if an identical title already exists in the top n newest articles", but it's not helpful in your case, because that option only checks the entries in the same feed, not globally (please correct me if I'm wrong), but your news source puts the duplicate entries across multiple different feeds.

Probably the condition to label the entry as 'duplicate' is not only simply based on entry's name or link but also based on content. Or only apply the duplicate name/link filter to a selected group of feeds, for example from the same news source your are having issue with.

My personal solution for now is to manually tweak FreshRSS code to ignore any entry with the same title globally, and for sources that have legit entries with duplicate titles, I need to generate my own feed so that the titles are unique.

What's your suggestion? Let's take https://entware.net/ and your news source as example.

pthoelken commented 9 months ago

My personal solution for now is to manually tweak FreshRSS code to ignore any entry with the same title globally, and for sources that have legit entries with duplicate titles, I need to generate my own feed so that the titles are unique.

For a first release of this feature this is as good start I think. Important is, that the user should choose If the option is active or not. Then it's a fine option, I think. :-)

kosukesando commented 9 months ago

What's your suggestion? Let's take https://entware.net/ and your news source as example.

In my case I'd use the timestamp to check whether they're "identical" or not, which is most likely enough for 90% of use cases (I hope)

vonguyenkhang commented 9 months ago

What's your suggestion? Let's take https://entware.net/ and your news source as example.

In my case I'd use the timestamp to check whether they're "identical" or not, which is most likely enough for 90% of use cases (I hope)

Do you mean to check the <published> or <updated> field in the feed entry? It won't work for my case, because:

checking the timestamps may work for the entries in the same feed, but it won't work across multiple feeds.
there are cases that the same entry appears in different feeds at different timings (for example, the news article first appears in 'Latest' feed, then after a while it appears in 'World News', then the same article appears again in "Technology' after a day).

Sorry I don't want to share my problematic news source, because it's not in any international language which is very inconvenient for the devs to work on. @pthoelken can you share your news source that has duplicate entries across multiple feeds?

kosukesando commented 9 months ago

there are cases that the same entry appears in different feeds at different timings (for example, the news article first appears in 'Latest' feed, then after a while it appears in 'World News', then the same article appears again in "Technology' after a day).

I understand some feeds do that, but in my case it would work, and I assume the same would be true for most feeds. The timestamp method also solves the potential problem in the example of https://entware.net/. The detection could be based on some window of time so in your example, maybe the article in the 'Latest' and 'World News' is flagged as duplicate. The one in the 'Technology' feed won't be flagged (or maybe it will, if you set the window to 1 day).

The alternative would be to somehow match the contents, which is technically possible but probably too much of a hassle for the devs. Better working than perfect, I guess.

vonguyenkhang commented 9 months ago

I think matching the content globally is the safest bet, but I don't know how much resource-intensive it is, probably depends on how big the database is.

To improve the efficiently, one option is to limit the search to a small time window behind the entry's timestamp as you suggested (but still need to extend the search scope to outside the current feed). Another option is to apply the search to only a group of selected feeds instead of globally. FreshRSS' existing option "in the top n newest articles" will also work if the condition is extended to "in the top n newest articles of each feed" (limited global search) or "in the top n newest articles of each feed in the same category".

Currently FreshRSS doesn't delete the suspected duplicate entries but marks them as read, so it's not really a serious issue if an entry is flagged wrongly. One way to side-step the issue is to allow users to decide which feed(s) to bypass all checkings and show all entries as-is, in case users think those feeds are important.

Just an off-topic comment, I feel that the current UI design for Subscription management is a bit lacking in terms of quick overview and batch actions. Below is a screenshot of the feed management of InoReader that I think (too) comprehensive:

[...] but probably too much of a hassle for the devs. Better working than perfect, I guess.

I think an active community is much appreciated everywhere, so just keep the feedback flowing and let the devs decide whether to adopt the ideas and plan the milestones accordingly.

Alkarex commented 9 months ago

An option, which could easily be added at category level and/or global level, is to automatically mark as read an entry if there is already another entry with the same URL. Would that help?

kosukesando commented 9 months ago

I think that's a pretty solid solution.

pthoelken commented 9 months ago

Would you put this in the upcoming roadmap? :-)

Meatball13 commented 1 month ago

Heya, just wondering if there's any ETA/updates on this? Have some feeds (AP News in particular) that will publish the same article across 2-3 feeds at the same time. I think if whatever functionality is being used to Mark an article as read if an identical title already exists in top n newest articles could just be expanded to look across all feeds and maybe even delete. Just let the user choose it.

Personally I'd rather even miss an article here and there if the filter/delete gets a bit aggressive than see 2-3 duplicates continuously. :) Thanks!

albrox commented 1 month ago

Seems abandoned?

mtalexan commented 1 month ago

It looks like someone added something to the UI at one point so global options relating to automatically as read can be added via the Settings -> Reading page. That location has the same "mark as read if title is identical to one of the last n articles" as is present in the per-feed settings. It doesn't seem to actually do anything though, I have a handful of feeds for different specific Reuters categories for example, and they often include the identical article in multiple feeds simultaneously, but none of them are being marked as read.

Frenzie commented 1 month ago

That works for a single feed only.

Alkarex commented 1 month ago

It is not a very difficult feature to add, but there are also plenty of other tasks to work on. So PR welcome ;-)

Meatball13 commented 1 month ago

Here's a pair of good examples. If you do need to test, AP is a good test case, you just gotta generate your own feeds through something like a website like rss.app because AP seems to have pulled their accessible feeds. These all got published at the same time, just across multiple feeds. I'd love to help if I had the slightest idea how, but unfortunately it's a bit outside my realm. :)

What Aaron Rodgers starting has anything to do with most of those categories, I don't know, but you know :)

Peronia commented 1 month ago

You could use RSS-Bridge for that. In my case it works fine, no more dupes. Have a look at the FeedMergeBridge.

vonguyenkhang commented 1 month ago

You could use RSS-Bridge for that. In my case it works fine, no more dupes. Have a look at the FeedMergeBridge.

FeedMergeBridge is a smart workaround to utilize the existing de-dup feature of FreshRSS without messing with the source code, thanks for sharing.

vonguyenkhang commented 1 month ago

You could use RSS-Bridge for that. In my case it works fine, no more dupes. Have a look at the FeedMergeBridge.

FeedMergeBridge is a smart workaround to utilize the existing de-dup feature of FreshRSS without messing with the source code, thanks for sharing.

Sorry FeedMergeBridge already has its own dedup using URL as unique key https://github.com/RSS-Bridge/rss-bridge/blob/80c43f10d83dcf4c0b9ff2707c6fe08fff8869ed/bridges/FeedMergeBridge.php#L96-L106 so there's no need to further enable FreshRSS 'mark dup as read'.

FreshRSS / FreshRSS

Clear duplicates #948

`\app\Models\EntryDAO.php`

`\app\Models\EntryDAOSQLite.php`