feedbin / support

83 stars 11 forks source link

Issue: Duplicated Feed Entries #23

Closed ghost closed 11 years ago

ghost commented 11 years ago

This issue started happening earlier today...

I have cleared out my unread list several times this afternoon, and the duplicates come back every time there are new, unread entries. I'm not sure it's always the same duplicates - but I do know that the ones pictured below do recur often...

Also, the last time I read through all unread entries, I clicked "Mark all as read" to see if that would help. It apparently did not.

feedbin-duplicates

benubois commented 11 years ago

Thanks, seeing the duplicate entry issue some feeds. Looking into it...

benubois commented 11 years ago

Fixed.

There will be many duplicates from high volume feeds today, but no more going forward.

recurser commented 11 years ago

Seeing quite a few doubled-up posts pointing at the same URL:

example1

example2

benubois commented 11 years ago

Yeah, this is a major problem on Hacker News. An entry on HN looks like

<item>
  <title>Portal Released For Steam On Linux</title>
  <link>http://www.phoronix.com/scan.php?page=news_item&amp;px=MTM2Mzk</link>
  <comments>https://news.ycombinator.com/item?id=5647914</comments>
  <description><![CDATA[<a href="https://news.ycombinator.com/item?id=5647914">Comments</a>]]></description>
</item>

So there isn't much to uniquely identify items by. In cases where a publisher does not have and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt to uniquely identify items.

If your example a period was added to the headline later on, which makes it look like a duplicate to a human, but it looks unique to the id generator.

recurser commented 11 years ago

Aha I see what you mean... I didn't notice that the titles are slightly different. Curious since it's the official HN feed, and they have the same HN post ID in the link, so it's the same canonical 'article' so to speak. I guess these are cases of the title being edited by HN mods after feedbin has already picked them up?

So there isn't much to uniquely identify items by. In cases where a publisher does not have 
and <id> or <guid> or <published>, Feedbin uses a combination of the link and title to attempt
to uniquely identify items.

In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.

Thanks for the explanation!

benubois commented 11 years ago

In the case of hacker news, item?id=5647914 uniquely identifies it, though I realise it's a slippery slope once you start customizing things on a feed-by-feed basis.

Hehe exactly. As I was posting the example I noticed that the <description> on Hacker News would make for a great ID, but that would be totally unique to them.

If I do start customizing the strategy for certain feeds this is first on my list.

nissimk commented 11 years ago

Use a hash of the item:

http://swik.net/RSS/RSS+Item+Uniqueness

recurser commented 11 years ago

@nissimk

Hash – most common method, simply hashing the entire item results in a somewhat unique id. This is however vulnerable to repeated feed items.

The problem is that the title of articles is changing over time, which makes hashing pretty difficult in the absence of a canonical id (?)

benubois commented 11 years ago

Here's the id strategy for Feedbin, definitely open to suggestions, although any changes would have to maintain backward compatibility so duplicates of old entries are not created:

def build_public_id(entry, feedzirra, saved_feed_url = nil)
  if saved_feed_url
    id_string = saved_feed_url.dup
  else      
    id_string = feedzirra.feed_url.dup
  end

  if entry.entry_id
    id_string << entry.entry_id.dup
  else
    if entry.url
      id_string << entry.url.dup
    end
    if entry.published
      id_string << entry.published.iso8601
    end
    if entry.title
      id_string << entry.title.dup
    end
  end
  Digest::SHA1.hexdigest(id_string)
end
nissimk commented 11 years ago

What about just excluding the title field from the item before hashing?

benubois commented 11 years ago

What about just excluding the title field from the item before hashing?

Definitely something I considered.

There are feeds that link to the same story multiple times, so the the link is not necessarily unique.

At the time I was thinking it would be better to create a duplicate than not import the item at all and potentially have missing unique items. I'm not sure which is the more common case.

Certainly with HN this is a bigger problem than other feeds.

ghost commented 11 years ago

Note: This has also been happening to me pretty often on Engadget's feed, if you need another test case...

Zegnat commented 11 years ago

@roomanitarian that’s weird, are you subscribed to http://www.engadget.com/rss.xml? That one includes guid elements for Feedbin to use so there should not be any duplicates. If there are any duplicates it’s either because Feedbin is broken or because Engadget is changing their own unique IDs (which would be completely silly and probably means a broken CMS on their part).

ghost commented 11 years ago

@Zegnat

Actually, I just removed that feed last week, due mainly to lack of interest on my part...

However, I just added it back in to see if I could find any duplicates... I didn't have to look very hard to find these 2 sets:

engadget-dupes

recurser commented 11 years ago

It might not be ideal, but would it be possible to make 'ignore duplicate URLs' or similar a global option in feedbin settings?

Zegnat commented 11 years ago

That’s really odd @recurser.

@benubois are you sure you are using guid tags? Looking at the Aspire example I believe Engadget did not change the guid value so it should not have copied.

benubois commented 11 years ago

I see what the problem is.

Feedbin uses Feedzirra for XML parsing.

In almost all cases, <guid> and <id> are normalized into entry_id.

The exception here, and the source of this particular problem is the Feedzirra::Parser::ITunesRSSItem strategy.

In this case the <guid> is NOT being normalized to entry_id so Feedbin falls back to not including the entry_id at all and instead uses link + title.

A fix for this is tricky. If the problem were fixed upstream or in the Feedbin fork of Feedzirra, duplicates would be created for every entry of every feed that uses Feedzirra::Parser::ITunesRSSItem, so that's no good.

One workaround would be to do something like:

if entry.published > DATE_OF_ITUNES_BUG_FIX
  if entry.entry_id
    entry.entry_id  = entry.entry_id.strip
  elsif entry.guid
    entry.entry_id  = entry.guid
  else
    entry.entry_id = nil
  end
else
  entry.entry_id = entry.entry_id ? entry.entry_id.strip : nil
end

The other alternative would be to generate two ids, one for the entry before the fix and one after. I think this is more work long term because then every item needs to be checked for dupes twice forever.

Does anyone see any potential issues with fix 1?

recurser commented 11 years ago

Looks good to me :+1:

andypearson commented 11 years ago

+1 to say I care about this issue :)

nicolashohm commented 11 years ago

+1 the problem is still present, in my case in a feed from dokuwiki

Zegnat commented 11 years ago

@nickel715, do you have an exact URL?

nicolashohm commented 11 years ago

@Zegnat https://uberspace.de/dokuwiki/feed.php

Zegnat commented 11 years ago

Hmm, that’s unrelated to this then. That feed does not seem to be an iTunes feed so the duplicates are for a separate reason. It might be related to the Pinboard feed issue as they both seem to use RSS 1 (RDF) syntax.

joshhinman commented 11 years ago

I'm still seeing this with several of my feeds, most notably MacWorld (http://rss.macworld.com/macworld/feeds/main) and LA Times (http://feeds2.feedburner.com/lanowblog)

benubois commented 11 years ago

@joshhinman Both Macworld and the LA Times feeds don't include and <id> or <guid> so Feedbin makes one up based on the title and link. The issue with this is that if the title or link change at all the id changes too so the entry looks like a duplicate.

recurser commented 10 years ago

@benubois HN response to this issue: https://github.com/HackerNews/HN/issues/42

benubois commented 10 years ago

@recurser,

Thanks I added a comment.

fma16 commented 10 years ago

Hi everyone, I don't know if it's the same error, but the PCInpact private feeds (they give one to their premium subscribers -like me) seems to suffers from this duplicate bug. The Feed (see https://gist.github.com/Zegnat/e0524aa33fb2b286f778)

feedbin 244 2014-04-16 12-59-48

Zegnat commented 10 years ago

Here is a dump of the feed. Feel free to remove the link to your private feed. I don’t see a problem there though, the items have <guid> elements et al.

benubois commented 10 years ago

Looks like they may have just switched domains nextinpact.com -> pcinpact.com.

This can cause duplicates when the domain is part of the guid.

Here's an example of a duplicated item with two distinct guids: "Les Google Glass se sont bien vendues aux États-Unis et passent à Android 4.4"

http://www.nextinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm
http://www.pcinpact.com/news/87085-les-google-glass-se-sont-bien-vendues-aux-etats-unis-et-passent-a-android-4-4.htm

The way around this is to not use the domain in the guid, but a lot of blogging software does this automatically.

fma16 commented 10 years ago

Yep, they made the change pcinpact.com -> nextinpact.com a couple weeks ago, but nextinpact.com stills redirects to pcinpact.com for now. Anyway, It's looks like that the bug doesn't happen anymore, so I'll consider it fixed for now.

Thanks for the help! :-D

svraka commented 10 years ago

Recently duplicates items started popping up in some wordpress.com feeds like http://formerf1doc.wordpress.com/feed/ and http://britishisms.wordpress.com/feed/.