Open gllmhyt opened 9 years ago
Thanks for the suggestion. I have been thinking of introducing such an option for a while, since it is also something I need. In your case, it is the same URL for both copies of the article, isn't it?
Any progress on this? I experience a lot of duplicate articles myself as well. Case 1 - duplicates from same rss feed source Case 2 - duplicates from multiple rss feed sources. Remove duplicates from source in case 1. Check if an article exists in one of the sources case 2. If exist then reject article.
In case 2 it would be nice to compare articles and choose the most complete article. (check for number of words, check for images)
@Wanabo In your case, how are your duplicates more precisely ? Do they have the exact same:
URL, may differ because the same article comes from different sources but are the same (=different websites) URL, is the same and comes from different sources BUT from the same website which has for example a feed for Sport and a feed for Lifestyle. The same article is in both feeds. Title see URL. Content see URL.
Hi,
My Situation is similar. My example is the following.
The option on some sites offer to subscribe to ALL articles or by category. Most of the time i am not interested in all but more then one categories. Lets say 2.
So I setup two rss url's, one for each category. But some articles are assigned to both categories... Resulting in 100% duplicate entries across two RRS url's.
My guess is that this is one of the easier duplicates to filter.
Bump on this, any news to an antidupe posts?
For rivers/planets, a dupe is where "guid", "title", "content_bin", "link" and finally "hash" have to be the same : if "content_bin" is not the same, this is a different comment from a river.
Here, lines 2&3 are dupes, 1&4 are same link, but with different comment, from different sources.
2 options would be great :
Possible?
@Liandriz Yes, this is a very desirable feature, which I would like to implement. I am thinking to implement an option to automatically mark an article as read when it is a duplicate of one already in the database.
But there are several details to take into account, for instance which version(s) should be marked as read, in particular when feeds are refreshed in a random order. It would probably require to specify which feed is the reference.
One possible solution is to make the categories sortable so that there's a hierarchy. We can then use the hierarchy to determine which source is the reference. Works like a "first-come-first-served" concept.
Any updates on this? Maybe something like inoreader could be implemented.
Hello, same, I would like this very useful function :) !
+1
Hi Alkarex, I wrote a query to find the duplicated titles. I plan to make a cron-job to set the duplicates to is_read = true. It seems that my mariadb version does not hava a rank funktion yet and that is the reason why I use the CASE WHEN.
SELECT *
FROM (SELECT ( CASE title
WHEN @curtype THEN @currow := @currow + 1
ELSE @currow := 1
AND @curtype := title
end ) + 1 AS rank,
freshrss_thomas_entry.*
FROM `freshrss_thomas_entry`,
(SELECT @currow := 0,
@curtype := '') r
ORDER BY title DESC) AS o
WHERE rank > 1
If someone is interested in a more complete solution let me know. I can not write php but I am able to use the database.
I'm not really bothered by it regardless but titles overlap all the time. URLs tend not to. I'd do either just URL or possibly URL and title. Moreover, in the cases where there's the most overlap (like Planet Debian) only the URLs match. So I'd say title is simultaneously false positive heaven yet almost never matching when you want it to.
@Kaan88 How does that inoreader screenshot you post work exactly? Is it global? Per feed?
Example of duplicate from https://www.clubic.com/articles.rss
<item>
<title>Réseaux LoRa & Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
<description> [...]</description>
<pubDate>Sun, 03 Jun 2018 18:45:00 +0200</pubDate>
<link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
<guid isPermaLink="false">843861</guid>
</item>
<item>
<title>Réseaux LoRa & Sigfox : il y a une vie en-dehors de la 3G, du Bluetooth et du Wi-Fi !</title>
<description>On estime que d'ici 2020, ce sont plus de 50 000 Go de données qui transiteront entre nos machines chaque seconde. Ces échanges massifs se font en partie via des réseaux sans fil, les plus répandus ét [...]</description>
<pubDate>Sun, 03 Jun 2018 18:30:00 +0200</pubDate>
<link>http://www.clubic.com/reseau-informatique/article-843857-1-reseaux-lora-sigfox-vie-dehors-3g-bluetooth-wi-fi.html</link>
<guid isPermaLink="false">843857</guid>
</item>
Coucou, un petit up car c'est une fonctionnalité importante qui manque à freshrss, encore merci pour votre travail :)
Hello, a little up because it is an important feature that is missing, thank you again for your work :)
@Jean-Mich-Much Oui, c'est aussi quelque chose que je souhaite moi aussi ajouter assez vite :-)
Bonjour, du nouveau sur cette fonctionnalité ? Hello, any news on this feature ?
Easy idea maybe... a cron that filters by the guid HAVING COUNT(guid) > 1 and set this to read?
a messy hack to ignore new entries with the same title (case-insensitive):
\app\Models\EntryDAO.php
// $sql = $this->sqlIgnoreConflict(
// 'INSERT INTO `_' . ($useTmpTable ? 'entrytmp' : 'entry') . '` (id, guid, title, author, '
// . ($this->isCompressed() ? 'content_bin' : 'content')
// . ', link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags) '
// . 'VALUES(:id, :guid, :title, :author, '
// . ($this->isCompressed() ? 'COMPRESS(:content)' : ':content')
// . ', :link, :date, :last_seen, '
// . $this->sqlHexDecode(':hash')
// . ', :is_read, :is_favorite, :id_feed, :tags)');
$sql = $this->sqlIgnoreConflict(
'INSERT INTO `_' . ($useTmpTable ? 'entrytmp' : 'entry') . '` (id, guid, title, author, '
. ($this->isCompressed() ? 'content_bin' : 'content')
. ', link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags) '
. 'SELECT :id, :guid, :title, :author, '
. ($this->isCompressed() ? 'COMPRESS(:content)' : ':content')
. ', :link, :date, :last_seen, '
. $this->sqlHexDecode(':hash')
. ', :is_read, :is_favorite, :id_feed, :tags '
. 'WHERE NOT EXISTS (SELECT title '
. 'FROM `_' . ($useTmpTable ? 'entrytmp' : 'entry') . '` '
. 'WHERE UPPER(title)=:upper_title )');
$this->addEntryPrepared = $this->pdo->prepare($sql);
}
if ($this->addEntryPrepared) {
$this->addEntryPrepared->bindParam(':id', $valuesTmp['id']);
$valuesTmp['guid'] = substr($valuesTmp['guid'], 0, 760);
$valuesTmp['guid'] = safe_ascii($valuesTmp['guid']);
$this->addEntryPrepared->bindParam(':guid', $valuesTmp['guid']);
$valuesTmp['title'] = mb_strcut($valuesTmp['title'], 0, 255, 'UTF-8');
$valuesTmp['title'] = safe_utf8($valuesTmp['title']);
$this->addEntryPrepared->bindParam(':title', $valuesTmp['title']);
$upper_title = strtoupper($valuesTmp['title']);
$this->addEntryPrepared->bindParam(':upper_title', $upper_title);
$valuesTmp['author'] = mb_strcut($valuesTmp['author'], 0, 255, 'UTF-8');
$valuesTmp['author'] = safe_utf8($valuesTmp['author']);
$this->addEntryPrepared->bindParam(':author', $valuesTmp['author']);
$valuesTmp['content'] = safe_utf8($valuesTmp['content']);
$this->addEntryPrepared->bindParam(':content', $valuesTmp['content']);
$valuesTmp['link'] = substr($valuesTmp['link'], 0, 1023);
$valuesTmp['link'] = safe_ascii($valuesTmp['link']);
$this->addEntryPrepared->bindParam(':link', $valuesTmp['link']);
$valuesTmp['date'] = min($valuesTmp['date'], 2147483647);
$this->addEntryPrepared->bindParam(':date', $valuesTmp['date'], PDO::PARAM_INT);
if (empty($valuesTmp['lastSeen'])) {
$valuesTmp['lastSeen'] = time();
}
\app\Models\EntryDAOSQLite.php
// $sql = '
// DROP TABLE IF EXISTS `tmp`;
// CREATE TEMP TABLE `tmp` AS
// SELECT id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
// FROM `_entrytmp`
// ORDER BY date;
// INSERT OR IGNORE INTO `_entry`
// (id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags)
// SELECT rowid + (SELECT MAX(id) - COUNT(*) FROM `tmp`) AS id,
// guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
// FROM `tmp`
// ORDER BY date;
// DELETE FROM `_entrytmp` WHERE id <= (SELECT MAX(id) FROM `tmp`);
// DROP TABLE IF EXISTS `tmp`;
// ';
$sql = '
DROP TABLE IF EXISTS `tmp`;
CREATE TEMP TABLE `tmp` AS
SELECT id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
FROM `_entrytmp`
ORDER BY date;
INSERT OR IGNORE INTO `_entry`
(id, guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags)
SELECT rowid + (SELECT MAX(id) - COUNT(*) FROM `tmp`) AS id,
guid, title, author, content, link, date, `lastSeen`, hash, is_read, is_favorite, id_feed, tags
FROM `tmp`
WHERE NOT EXISTS ( SELECT title FROM `_entry` WHERE UPPER(`tmp`.`title`)=UPPER(`_entry`.`title`) )
ORDER BY date;
DELETE FROM `_entrytmp` WHERE id <= (SELECT MAX(id) FROM `tmp`);
DROP TABLE IF EXISTS `tmp`;
';
WARNINGS:
Filtering some duplicates based on the title (in the same feed) will land with https://github.com/FreshRSS/FreshRSS/pull/3303
Hi, I have the 'if identical title exists' setting turned on for top 100 articles, but it seems to not work for my environment/feeds. Here is one example of two items from the RSS feed and the corresponding RSS feed file. Do you have any idea as to why this may be? Running in a Docker container, version is 1.21.0
<item>
<title>河野デジタル相 マイナカード窓口視察 “総点検通じ信頼回復”</title>
<link>http://www3.nhk.or.jp/news/html/20230709/k10014123251000.html</link>
<description><![CDATA[河野デジタル大臣は、マイナンバーカードの手続きの窓口を担う兵庫県の自治体を視察し、一連のトラブルによるマイナンバー制度への不信感について、現在、自治体などで進められている総点検を通じて信頼の回復を目指す考えを示しました。]]></description>
<pubDate>Sun, 09 Jul 2023 05:31:00 +0000</pubDate>
<guid isPermaLink="false">1688883303384302</guid>
</item>
<item>
<title>河野デジタル相 マイナカード窓口視察 “総点検通じ信頼回復”</title>
<link>http://www3.nhk.or.jp/news/html/20230709/k10014123251000.html</link>
<description><![CDATA[河野デジタル大臣は、マイナンバーカードの手続きの窓口を担う兵庫県の自治体を視察し、一連のトラブルによるマイナンバー制度への不信感について、現在、自治体などで進められている総点検を通じて信頼の回復を目指す考えを示しました。]]></description>
<pubDate>Sun, 09 Jul 2023 05:31:00 +0000</pubDate>
<guid isPermaLink="false">1688883303384301</guid>
</item>
Is there any update in this case? It's really anoyin bc the news portal put a news post in category a and also in category b. When I supscribe both categories (for reasons) I got a lot of duplicates. @Alkarex
Is there any update in this case? It's really anoyin bc the news portal put a news post in category a and also in category b. When I supscribe both categories (for reasons) I got a lot of duplicates. @Alkarex
I think the logic to filter the duplicate entries is tricky, because as you say, there are some news sources that put the same entry with the same title and same link in multiple feeds, but there are also sources that use the same titles for the entries but the contents are totally different, for example https://entware.net/, and there's no rule that each entry title needs to be unique, so we can't fault anyone.
In FreshRSS, there's an option "Mark an article as read… if an identical title already exists in the top n newest articles", but it's not helpful in your case, because that option only checks the entries in the same feed, not globally (please correct me if I'm wrong), but your news source puts the duplicate entries across multiple different feeds.
Probably the condition to label the entry as 'duplicate' is not only simply based on entry's name or link but also based on content. Or only apply the duplicate name/link filter to a selected group of feeds, for example from the same news source your are having issue with.
My personal solution for now is to manually tweak FreshRSS code to ignore any entry with the same title globally, and for sources that have legit entries with duplicate titles, I need to generate my own feed so that the titles are unique.
What's your suggestion? Let's take https://entware.net/ and your news source as example.
My personal solution for now is to manually tweak FreshRSS code to ignore any entry with the same title globally, and for sources that have legit entries with duplicate titles, I need to generate my own feed so that the titles are unique.
For a first release of this feature this is as good start I think. Important is, that the user should choose If the option is active or not. Then it's a fine option, I think. :-)
What's your suggestion? Let's take https://entware.net/ and your news source as example.
In my case I'd use the timestamp to check whether they're "identical" or not, which is most likely enough for 90% of use cases (I hope)
What's your suggestion? Let's take https://entware.net/ and your news source as example.
In my case I'd use the timestamp to check whether they're "identical" or not, which is most likely enough for 90% of use cases (I hope)
Do you mean to check the <published>
or <updated>
field in the feed entry? It won't work for my case, because:
Sorry I don't want to share my problematic news source, because it's not in any international language which is very inconvenient for the devs to work on. @pthoelken can you share your news source that has duplicate entries across multiple feeds?
there are cases that the same entry appears in different feeds at different timings (for example, the news article first appears in 'Latest' feed, then after a while it appears in 'World News', then the same article appears again in "Technology' after a day).
I understand some feeds do that, but in my case it would work, and I assume the same would be true for most feeds. The timestamp method also solves the potential problem in the example of https://entware.net/. The detection could be based on some window of time so in your example, maybe the article in the 'Latest' and 'World News' is flagged as duplicate. The one in the 'Technology' feed won't be flagged (or maybe it will, if you set the window to 1 day).
The alternative would be to somehow match the contents, which is technically possible but probably too much of a hassle for the devs. Better working than perfect, I guess.
I think matching the content globally is the safest bet, but I don't know how much resource-intensive it is, probably depends on how big the database is.
To improve the efficiently, one option is to limit the search to a small time window behind the entry's timestamp as you suggested (but still need to extend the search scope to outside the current feed). Another option is to apply the search to only a group of selected feeds instead of globally. FreshRSS' existing option "in the top n newest articles" will also work if the condition is extended to "in the top n newest articles of each feed" (limited global search) or "in the top n newest articles of each feed in the same category".
Currently FreshRSS doesn't delete the suspected duplicate entries but marks them as read, so it's not really a serious issue if an entry is flagged wrongly. One way to side-step the issue is to allow users to decide which feed(s) to bypass all checkings and show all entries as-is, in case users think those feeds are important.
Just an off-topic comment, I feel that the current UI design for Subscription management is a bit lacking in terms of quick overview and batch actions. Below is a screenshot of the feed management of InoReader that I think (too) comprehensive:
[...] but probably too much of a hassle for the devs. Better working than perfect, I guess.
I think an active community is much appreciated everywhere, so just keep the feedback flowing and let the devs decide whether to adopt the ideas and plan the milestones accordingly.
An option, which could easily be added at category level and/or global level, is to automatically mark as read an entry if there is already another entry with the same URL. Would that help?
I think that's a pretty solid solution.
Would you put this in the upcoming roadmap? :-)
Heya, just wondering if there's any ETA/updates on this? Have some feeds (AP News in particular) that will publish the same article across 2-3 feeds at the same time. I think if whatever functionality is being used to Mark an article as read if an identical title already exists in top n newest articles could just be expanded to look across all feeds and maybe even delete. Just let the user choose it.
Personally I'd rather even miss an article here and there if the filter/delete gets a bit aggressive than see 2-3 duplicates continuously. :) Thanks!
Seems abandoned?
It looks like someone added something to the UI at one point so global options relating to automatically as read can be added via the Settings -> Reading page. That location has the same "mark as read if title is identical to one of the last n articles" as is present in the per-feed settings. It doesn't seem to actually do anything though, I have a handful of feeds for different specific Reuters categories for example, and they often include the identical article in multiple feeds simultaneously, but none of them are being marked as read.
That works for a single feed only.
It is not a very difficult feature to add, but there are also plenty of other tasks to work on. So PR welcome ;-)
Here's a pair of good examples. If you do need to test, AP is a good test case, you just gotta generate your own feeds through something like a website like rss.app because AP seems to have pulled their accessible feeds. These all got published at the same time, just across multiple feeds. I'd love to help if I had the slightest idea how, but unfortunately it's a bit outside my realm. :)
What Aaron Rodgers starting has anything to do with most of those categories, I don't know, but you know :)
You could use RSS-Bridge for that. In my case it works fine, no more dupes. Have a look at the FeedMergeBridge.
You could use RSS-Bridge for that. In my case it works fine, no more dupes. Have a look at the FeedMergeBridge.
FeedMergeBridge is a smart workaround to utilize the existing de-dup feature of FreshRSS without messing with the source code, thanks for sharing.
You could use RSS-Bridge for that. In my case it works fine, no more dupes. Have a look at the FeedMergeBridge.
FeedMergeBridge is a smart workaround to utilize the existing de-dup feature of FreshRSS without messing with the source code, thanks for sharing.
Sorry FeedMergeBridge already has its own dedup using URL as unique key https://github.com/RSS-Bridge/rss-bridge/blob/80c43f10d83dcf4c0b9ff2707c6fe08fff8869ed/bridges/FeedMergeBridge.php#L96-L106 so there's no need to further enable FreshRSS 'mark dup as read'.
Here's the situation:
Once in a while, I have duplicates : the article of the planet subscription and the one of the blog subscription.
I'm wondering if it's possible to think about a way to "clear" the displayed articles in those cases and to display only one of them (like the original one, not the copies on the planets).