Closed ghost closed 11 years ago
Thanks, seeing the duplicate entry issue some feeds. Looking into it...
Fixed.
There will be many duplicates from high volume feeds today, but no more going forward.
Seeing quite a few doubled-up posts pointing at the same URL:
Yeah, this is a major problem on Hacker News. An entry on HN looks like
<item>
<title>Portal Released For Steam On Linux</title>
<link>http://www.phoronix.com/scan.php?page=news_item&px=MTM2Mzk</link>
<comments>https://news.ycombinator.com/item?id=5647914</comments>
<description><![CDATA[<a href="https://news.ycombinator.com/item?id=5647914">Comments</a>]]></description>
</item>
So there isn't much to uniquely identify items by. In cases where a publisher does not have and <id>
or <guid>
or <published>
, Feedbin uses a combination of the link and title to attempt to uniquely identify items.
If your example a period was added to the headline later on, which makes it look like a duplicate to a human, but it looks unique to the id generator.
Aha I see what you mean... I didn't notice that the titles are slightly different. Curious since it's the official HN feed, and they have the same HN post ID in the link, so it's the same canonical 'article' so to speak. I guess these are cases of the title being edited by HN mods after feedbin has already picked them up?
So there isn't much to uniquely identify items by. In cases where a publisher does not have
and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt
to uniquely identify items.
In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.
Thanks for the explanation!
In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.
Hehe exactly. As I was posting the example I noticed that the <description>
on Hacker News would make for a great ID, but that would be totally unique to them.
If I do start customizing the strategy for certain feeds this is first on my list.
Use a hash of the item:
@nissimk
Hash – most common method, simply hashing the entire item results in a somewhat unique id. This is however vulnerable to repeated feed items.
The problem is that the title of articles is changing over time, which makes hashing pretty difficult in the absence of a canonical id (?)
Here's the id strategy for Feedbin, definitely open to suggestions, although any changes would have to maintain backward compatibility so duplicates of old entries are not created:
def build_public_id(entry, feedzirra, saved_feed_url = nil)
if saved_feed_url
id_string = saved_feed_url.dup
else
id_string = feedzirra.feed_url.dup
end
if entry.entry_id
id_string << entry.entry_id.dup
else
if entry.url
id_string << entry.url.dup
end
if entry.published
id_string << entry.published.iso8601
end
if entry.title
id_string << entry.title.dup
end
end
Digest::SHA1.hexdigest(id_string)
end
What about just excluding the title field from the item before hashing?
What about just excluding the title field from the item before hashing?
Definitely something I considered.
There are feeds that link to the same story multiple times, so the the link is not necessarily unique.
At the time I was thinking it would be better to create a duplicate than not import the item at all and potentially have missing unique items. I'm not sure which is the more common case.
Certainly with HN this is a bigger problem than other feeds.
Note: This has also been happening to me pretty often on Engadget's feed, if you need another test case...
@roomanitarian that’s weird, are you subscribed to http://www.engadget.com/rss.xml
? That one includes guid
elements for Feedbin to use so there should not be any duplicates. If there are any duplicates it’s either because Feedbin is broken or because Engadget is changing their own unique IDs (which would be completely silly and probably means a broken CMS on their part).
@Zegnat
Actually, I just removed that feed last week, due mainly to lack of interest on my part...
However, I just added it back in to see if I could find any duplicates... I didn't have to look very hard to find these 2 sets:
It might not be ideal, but would it be possible to make 'ignore duplicate URLs' or similar a global option in feedbin settings?
That’s really odd @recurser.
@benubois are you sure you are using guid
tags? Looking at the Aspire example I believe Engadget did not change the guid
value so it should not have copied.
I see what the problem is.
Feedbin uses Feedzirra for XML parsing.
In almost all cases, <guid>
and <id>
are normalized into entry_id
.
The exception here, and the source of this particular problem is the Feedzirra::Parser::ITunesRSSItem
strategy.
In this case the <guid>
is NOT being normalized to entry_id
so Feedbin falls back to not including the entry_id at all and instead uses link + title.
A fix for this is tricky. If the problem were fixed upstream or in the Feedbin fork of Feedzirra, duplicates would be created for every entry of every feed that uses Feedzirra::Parser::ITunesRSSItem
, so that's no good.
One workaround would be to do something like:
if entry.published > DATE_OF_ITUNES_BUG_FIX
if entry.entry_id
entry.entry_id = entry.entry_id.strip
elsif entry.guid
entry.entry_id = entry.guid
else
entry.entry_id = nil
end
else
entry.entry_id = entry.entry_id ? entry.entry_id.strip : nil
end
The other alternative would be to generate two ids, one for the entry before the fix and one after. I think this is more work long term because then every item needs to be checked for dupes twice forever.
Does anyone see any potential issues with fix 1?
Looks good to me :+1:
+1 to say I care about this issue :)
+1 the problem is still present, in my case in a feed from dokuwiki
@nickel715, do you have an exact URL?
Hmm, that’s unrelated to this then. That feed does not seem to be an iTunes feed so the duplicates are for a separate reason. It might be related to the Pinboard feed issue as they both seem to use RSS 1 (RDF) syntax.
I'm still seeing this with several of my feeds, most notably MacWorld (http://rss.macworld.com/macworld/feeds/main) and LA Times (http://feeds2.feedburner.com/lanowblog)
@joshhinman Both Macworld and the LA Times feeds don't include and <id>
or <guid>
so Feedbin makes one up based on the title and link. The issue with this is that if the title or link change at all the id changes too so the entry looks like a duplicate.
@benubois HN response to this issue: https://github.com/HackerNews/HN/issues/42
Hi everyone, I don't know if it's the same error, but the PCInpact private feeds (they give one to their premium subscribers -like me) seems to suffers from this duplicate bug. The Feed (see https://gist.github.com/Zegnat/e0524aa33fb2b286f778)
Here is a dump of the feed. Feel free to remove the link to your private feed. I don’t see a problem there though, the items have <guid>
elements et al.
Looks like they may have just switched domains nextinpact.com -> pcinpact.com
.
This can cause duplicates when the domain is part of the guid.
Here's an example of a duplicated item with two distinct guids: "Les Google Glass se sont bien vendues aux États-Unis et passent à Android 4.4"
http://www.nextinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm
http://www.pcinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm
The way around this is to not use the domain in the guid, but a lot of blogging software does this automatically.
Yep, they made the change pcinpact.com -> nextinpact.com
a couple weeks ago, but nextinpact.com
stills redirects to pcinpact.com
for now.
Anyway, It's looks like that the bug doesn't happen anymore, so I'll consider it fixed for now.
Thanks for the help! :-D
Recently duplicates items started popping up in some wordpress.com feeds like http://formerf1doc.wordpress.com/feed/ and http://britishisms.wordpress.com/feed/.
This issue started happening earlier today...
I have cleared out my unread list several times this afternoon, and the duplicates come back every time there are new, unread entries. I'm not sure it's always the same duplicates - but I do know that the ones pictured below do recur often...
Also, the last time I read through all unread entries, I clicked "Mark all as read" to see if that would help. It apparently did not.