fossar / selfoss

multipurpose rss reader, live stream, mashup, aggregation web application
https://selfoss.aditu.de
GNU General Public License v3.0
2.37k stars 343 forks source link

Duplicate items #208

Closed Etenil closed 11 years ago

Etenil commented 11 years ago

The feed reader sometimes fails to detect the presence of an item within the database an thus creates multiple items.

This only happens with Slashdot's rss feed on my server, and slashdot uses some very long hyperlink as uid, which might be the cause of the problem.

I have been able to solve this on my server by modifying the dao and update code so that items'presence in the database be determined by both their uid and their link. So far it's worked OK, but I don't think it's a clean fix and thus will not commit it.

GLLM commented 11 years ago

Hi Etenil,

I also have many duplicates (articles) and I'd like to clean the DB. Do you have a working solution for this? It could be of interest for many of us.

Thanks

Etenil commented 11 years ago

Hi GLLM,

I don't have a solution to clean up the DB, what I've done so far only prevents fetching duplicate items (so that'll work for future updates).

My patch is essentially a dirty fix that doesn't address the main issue here. The problem is that selfoss fails to check for presence of an existing item in the DB. My thinking is that it's somehow related to the uid being too long or something. But instead of fixing that, I'm just using the title field together with the uid to discriminate the items. I hope the selfoss dudes would have addressed this properly in the meantime.

With that in mind, and since you ask it so nicely, I've forked the repo and will put my patch in there today.

GLLM commented 11 years ago

That is kind of you.

If, in a first step, it could enable us to wipe the dupes, it'd be great !

Thank you very much :) GLLM

SSilence commented 11 years ago

Can you post an example feed? Some feeds have wrong ids and then selfoss can not differ the articles properly. Changing this id generation is a very critical part. I use the SimplePie mechanism which uses the feed given id and then uses a content based md5 hash. I possible solution would be a own spout for RSS feeds with problematical ids.

Etenil commented 11 years ago

@SSilence just grab slashdot's RSS feed; it consistently produces duplicates so you shouldn't have too much trouble diagnosing it.

Etenil commented 11 years ago

@SSilence You can just use slashdot's feed, it consistently creates duplicates. You'll notice that the articles'uid is, well, peculiar ;-). Not sure what needs fixing, I didn't dig much into selfoss's code. Like I said, my quick and dirty fix is to use both the uid and the link to ensure the record is unique, but that's not a good thing.

@GLLM I've published my changes.

SSilence commented 11 years ago

I have tested the slashdot feed and no duplicate entries occurred. But I will wait until slashdot updates his feed. I think its possible, that the guid changes on every feed update. In this case all items should be fetched again.

GLLM commented 11 years ago

I'll wait for your feedback to see if I shall consider another solution, such as manual SQL, to remove duplicates. I have too many of them. And I believe I do not do concurrent updates, since my cron is launch once every 60min ...

Thanks, GLLM

SSilence commented 11 years ago

@GLLM: Do you have a few example feeds with duplicate entries?

@Etenil: Thats really strange, I have updated the slashdot feed today and don't get duplicates. I have tested this with sqlite. I have seen that you use mysql. I will test again with mysql.

SSilence commented 11 years ago

Okay, I have found the problem. slashdots uids has more than 255 characters. Sqlite don't cares this and compares the first 255 characters. Mysql will never find the existing items and returns an empty result.

Now I generate an md5 hash on uids with more then 255 characters.

Please reopen this issue if duplicate items occur again.

GLLM commented 11 years ago

@SSilence sorry for answering too late ... I've had duplicates on (among many others) :

Thanks GLLM

binghuiyin commented 11 years ago

still have duplicates such as following:

http://avaxhome.ws/ebooks/programming_development/rss.xml

2013-04-11_142026

seanrand commented 11 years ago

@binghuiyin: Which database backend do you use and how do you call update.php to update your feeds?

The only way I'm still able to create a few duplicates in the db is when I run two instances of update.php in parallel:

$ sqlite3 data/sqlite/selfoss.db
sqlite> SELECT id, source, datetime, title, count(*) FROM items GROUP BY title, datetime HAVING count(*) > 1;
id          source      datetime             title                                                  count(*)
----------  ----------  -------------------  -----------------------------------------------------  ----------
1017        16          2013-04-11 22:23:35  Google Relaxes DMCA Takedown Restrictions, Eyes Abuse  2
1022        3           2013-04-11 21:48:47  Kurdish rebels prepare for peace                       2
binghuiyin commented 11 years ago

I added a Cron job with hourly update. as below. It is on Dreamhost.

![Uploading 2013-04-11_155810.png . . .]()

GLLM commented 11 years ago

Hourly cron job on a sqlite db... getting duplicates every day, not too many, but still getting those.

binghuiyin commented 11 years ago

@SSilence please re-open this Issue. See pic below: 2013-04-11_142026

GLLM commented 11 years ago

Dupes & dupes again !

It's a pain ... FYI : hourly cron with SQLite. No other manual updates of course screenshot_4

SSilence commented 11 years ago

I subscribed the three feeds and will test. I don't know how duplicate entries can occur because the feeds and the uids of the feeds seems to be okay.

Are you all using the newest version of selfoss?

I will try to find this bug.

Etenil commented 11 years ago

Something strikes me as wrong with the way feeds are handled. Indeed, according to the RSS2 specifications, the guid key is not at all required:

All elements of an item are optional, however at least one of title or description must be present.

See http://www.rssboard.org/rss-specification#hrelementsOfLtitemgt

GLLM commented 11 years ago

@SSilence I am updating to the latest available code every day ! still I have the dupes :(

SSilence commented 11 years ago

I have subscribed the rss feeds

And I have no duplicate items. Hmm, thats very strange. Could this be a particular sqlite driver version which is not compatible with fat free?

binghuiyin commented 11 years ago

@SSilence I am using v2.6 now, with MySQL. Here is another rss for your test. I have duplicated items. It is a feed in Chinese. http://feed.36kr.com/c/33346/f/566026/index.rss 2013-04-18_120744

binghuiyin commented 11 years ago

@SSilence I did a search in database by searching the link. It return by three duplicated items. 2013-04-21_235823 It looks everything same, except uid.

seanrand commented 11 years ago

@SSilence: Why don't you just deal with this on the database level and make UID a primary key? [Edit: That obviously still wouldn't help with feeds like the one binghuiyin linked, but this would -->] Or make title, content and link a compound key. Using keys would also be a better fix for #89.

Edit: I just looked at the feed binghuiyin linked and that looks like an issue with the feed... they are generating different GUIDs for the same item. My guess is that the GUID uses the current date and thus every 24 hours every item has a new GUID.

RSS implementations really are a mess.

SSilence commented 11 years ago

Yes, it seems the feed provider generates every day a new uid. Its really hard to handle this problems :(

ghost commented 10 years ago

Was this ever fixed as I am getting the same issue?

andreimarcu commented 10 years ago

I am getting the same issue here too.

A lot of feed generators don't follow specification and I had much more luck with using the entry's link rather than uid as a key.

Just my 2 cents.

cgelici commented 10 years ago

I'm getting it too.

image

ghost commented 10 years ago

Can't you filter them out on the application ?

bitvijays commented 7 years ago

@SSilence Thank you for authoring selfoss. Helps a lot. Also, The issue of duplicates is still existing. I am getting duplicate for Packetstorm News/ Packetstorm Files. Probably, you might want to see the "Link" and "Source" in the database, to see if it's already present?

jtojnar commented 7 years ago

@bitvijays Can you locate the duplicate articles in the database?

niol commented 7 years ago

The results of a query such as this one would help us find the right solution to this:

SELECT items.uid, items2.uid, items.title, items2.title FROM items, items AS items2 WHERE items.source = items2.source AND items.id > items2.id AND items.link = items2.link;

I have a feed that provides new items with the same link, so the link alone is not good. Also I'm against avoiding duplicated across multiple feeds because this allows one feed to prevent items from other feeds to get into the db.

bitvijays commented 7 years ago

selfoss.zip

@niol @jtojnar Here's the zip file containing the sqlite database. ( 1 MB only ). Hopefully, this should provide more insight.

jtojnar commented 7 years ago

I do not understand how could this happen in proper operation. On the bright side, I fixed favicons for two of your feeds 😉

niol commented 7 years ago

Duplicated uids may only happen if multiple parallel updates are running, for instance if a cronjob is running while a manual update is triggered manually.

I had proposed something to fix this some time ago (see #597) which was using a file lock to prevent concurrent updates on the same source. Another option would be to add a constraint on the items table ensuring that (uid, source) is unique and handle the INSERT error properly.

jtojnar commented 7 years ago

@niol I think using UNIQUE constraint is preferable due to a lower number of moving parts. Additionally, instead of handling an error, UPSERT can possibly be used, though it is hairy on PostgreSQL < 9.5