barisusakli / nodebb-plugin-rss

A NodeBB Plugin to post topics using RSS feeds
28 stars 23 forks source link

RSS feed entries published 4 times #8

Closed frenchja closed 7 years ago

frenchja commented 9 years ago

New RSS subscriptions, when published to a category, duplicate each entry multiple times. This was after a fresh install and after the nodebb-rss-plugin hashes and keys were deleted from the RedisDB. Could this be a problem with cron? Interestingly, when an entry is deleted and purged, the duplicates are blank:

screen shot 2015-01-14 at 2 49 07 pm This might indicate that the content isn't duplicated, just the Topic.

barisusakli commented 9 years ago

I'm not sure why this happens, the plugin saves the lastEntry date into the database and on the next cron tick posts entries that have a newer entry date. https://github.com/barisusakli/nodebb-plugin-rss/blob/master/index.js#L137-L156

So unless the feed entry.publishedDate is changing to a newer date there shouldn't be duplicates.

frenchja commented 9 years ago

Is lastEntryDate a UNIX timestamp? Looking at the field value for HMGET "nodebb-plugin-rss:feed:https://www.reddit.com/r/theworstofnetflix/.rss" "lastEntryDate" gets me 1420160824000, which is Sun, 07 Feb 46973 13:46:40 GMT?

The same is for a different entry, HMGET "nodebb-plugin-rss:feed:http://www.mst3kinfo.com/?feed=rss2" "lastEntryDate" We have 3 appending 0s at the end, creating an incorrect timestamp.

1421211715 vs. 1421211715000

barisusakli commented 9 years ago

It is in milliseconds I believe so it would be Fri, 02 Jan 2015 01:07:04 GMT

frenchja commented 9 years ago

Ah, thanks for the clarification. My fault.

barisusakli commented 9 years ago

Ill make a commit the timestamps are probably strings need to parseInt them.

barisusakli commented 9 years ago

https://github.com/barisusakli/nodebb-plugin-rss/commit/e5c3947675d51d5a0c591823ff3c1e0685648001

Try with version 0.2.6

frenchja commented 9 years ago

Thanks. Testing now with two feeds in a throwaway category. I did notice using ./nodebb dev that whenever I try to press 'Save' for my feeds that I get this and NodeBB restarts:

error: TypeError: undefined is not a function
    at /home/user/node5/src/database/redis/sets.js:24:4
    at try_callback (/home/user/node5/node_modules/redis/index.js:573:9)
    at RedisClient.return_reply (/home/user/node5/node_modules/redis/index.js:661:13)
    at ReplyParser.<anonymous> (/home/user/node5/node_modules/redis/index.js:309:14)
    at ReplyParser.emit (events.js:95:17)
    at ReplyParser.send_reply (/home/user/node5/node_modules/redis/lib/parser/javascript.js:300:10)
    at ReplyParser.execute (/home/user/node5/node_modules/redis/lib/parser/javascript.js:189:22)
    at RedisClient.on_data (/home/user/node5/node_modules/redis/index.js:534:27)
    at Socket.<anonymous> (/home/user/node5/node_modules/redis/index.js:91:14)
    at Socket.emit (events.js:95:17)
barisusakli commented 9 years ago

What version of nodebb are you on?

frenchja commented 9 years ago

v0.6.0. Should be up to date with branch v0.6.x.

barisusakli commented 9 years ago

Should be fixed in 0.2.7 version of the plugin. Let me know.

frenchja commented 9 years ago

Seemed to be working fine when I had it on a 1 minute interval. Now I'm getting the 4 duplicated posts again.

./nodebb log

14/1 15:54 [32444] - error: [[error:too-many-posts-newbie, 120, 3]]
14/1 15:56 [32444] - error: [[error:too-many-posts-newbie, 120, 3]]

This discrepancy in the timestamp is interesting though: screen shot 2015-01-14 at 3 58 40 pm

Not always the case though. Here is a feed just added from blog.rifftrax.com: screen shot 2015-01-14 at 4 02 43 pm

barisusakli commented 9 years ago

Could be this

So unless the feed entry.publishedDate is changing to a newer date there shouldn't be duplicates.

Maybe the entry is updated and its published date is updating to a newer date, then it will be reposted as a duplicate.

14/1 15:54 [32444] - error: [[error:too-many-posts-newbie, 120, 3]]

0.2.8 should fix this.

barisusakli commented 9 years ago

Can you also post the output of smembers nodebb-pluing-rss:feeds

frenchja commented 9 years ago
SMEMBERS nodebb-plugin-rss:feeds
1) "https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new"
2) "https://www.reddit.com/r/theworstofnetflix/.rss?sort=new"
3) "http://interociter.movieholics.tv/feed/"
4) "http://feeds.feedburner.com/RiffTrax?format=xml"
5) "http://www.cinematicftp.com/feed/"
6) "http://www.mst3kinfo.com/?feed=rss2"
barisusakli commented 9 years ago

Is this problem happening on all feeds or just some of them?

pichalite commented 9 years ago

I have the same problem with all of the feeds that I have setup in the plugin. Lot of times the duplicates are just topics with 0 posts.

barisusakli commented 9 years ago

Do you have the same error message in your logs, try with the latest version of this plugin.

pichalite commented 9 years ago

I upgraded to the latest version of the plugin. Added a new feed, set it to run at 1 minute interval and #Entries / Interval to 25. I see a lot of these errors.

Feed URL: http://www.mlbtraderumors.com/feed

I see duplicates and some topics with no content.

14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
barisusakli commented 9 years ago

Try 0.2.9. That error prevents post spamming but in the case of this plugin I reset the posters lastposttime so they can post more than one topic. https://github.com/barisusakli/nodebb-plugin-rss/blob/master/index.js#L201

pichalite commented 9 years ago

Duplicates and empty topics increased since the recent updates.

frenchja commented 9 years ago

@barisusakli, the problematic feed seems to be http://www.mst3kinfo.com/?feed=rss2, with other feeds posted correctly this week, but I'll spend a few more days verifying this. I wondered if there's something different about certain feeds that causes problems with the JSON API, but that doesn't seem to be the case, looking at the JSON response.

library(jsonlite)
data <- fromJSON("https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&q=http://www.mst3kinfo.com/?feed=rss2")
data$responseData$feed$entries$title
[1] "Weekend Discussion Thread: Questions about the MSTed Movies"
[2] "New Short from RiffTrax…"                                   
[3] "This Date in MSTory"                                        
[4] "More Scholarly Study of MST3K"   

EDIT: Upon logging into the test category, the Reddit RSS feeds are all double posted too.

/unread appears to only have 1 unread notice per entrie, which is weird. I'll spin up a Fedora instance and try to test the plugin on a fresh forum without other plugins. If it is a plugin problem, it'll be a bit difficult to test the main effect and interaction of so many plugins so your insight might be helpful.

pichalite commented 9 years ago

@barisusakli, found an interesting bug. If I enable clustering and set NodeBB to run on 3 ports, when I add an RSS feed to pull 4 entries, it's actually posting around 28 - 60 entries with lot of duplicates. This doesn't happen if I just run NodeBB on 1 port.

barisusakli commented 9 years ago

That should be easy to fix. I'll look into it.

barisusakli commented 9 years ago

Above commit should fix the issue with clustering.

frenchja commented 9 years ago

Testing now. Hopefully this will fix the issue. Thanks!

barisusakli commented 9 years ago

@frenchja not sure if this fixes your problem, were you running more than 1 nodebb instances?

frenchja commented 9 years ago

@barisusakli Doesn't nodebb start multiple processes depending on the # of CPUs? I'm probably wrong and will dig into the architecture more.

barisusakli commented 9 years ago

Nah it will read the "port" property from the config.json file if it is an array it will create X nodebbs on each of those ports. If you don't specify a port property it reads the port from the url and spawns a single nodebb. If url doesn't have a port it falls back to 4567.

https://github.com/NodeBB/NodeBB/blob/master/loader.js#L185 https://github.com/NodeBB/NodeBB/blob/master/loader.js#L141 https://docs.nodebb.org/en/latest/configuring/config.html

pichalite commented 9 years ago

@barisusakli I removed the URL property in config.json yesterday to run a single NodeBB instance. The first run for RSS feeds worked fine, but after that it started posting duplicates again, 3 for each feed entry. I will try with the latest commit and see if anything changed.

frenchja commented 9 years ago

@barisusakli Just checking in. Is there anything odd about the settings below that might cause the behavior? screen shot 2015-02-04 at 12 37 55 pm

Would it be possible to add a 'Debug' radio button that outputs the response of each function to a logger? Also, I've listed my npm ls here in case I've messed up a dependency.

pichalite commented 9 years ago

@barisusakli Works way better now with the latest fix for clusters. Doesn't post duplicates or topics without posts anymore.

barisusakli commented 9 years ago

@frenchja try with 0.2.12 it will print out a line whenever an entry is posted.

Should look like

[plugin-rss] posting, http://feedurl.rss - title: my topic title, published date: <date of publish here>

When you get duplicate posts post your logs.

frenchja commented 9 years ago
6/2 21:03 [27679] - error: /
 Error: invalid csrf token
    at module.exports (/home/frenchja/node5/node_modules/csurf/node_modules/http-errors/index.js:32:16)
    at verifytoken (/home/frenchja/node5/node_modules/csurf/index.js:237:11)
    at Object.csrf [as applyCSRF] (/home/frenchja/node5/node_modules/csurf/index.js:100:7)
    at Object.middleware.buildHeader (/home/frenchja/node5/src/middleware/middleware.js:187:13)
    at /home/frenchja/node5/src/routes/index.js:197:15
    at Layer.handle [as handle_request] (/home/frenchja/node5/node_modules/express/lib/router/layer.js:82:5)
    at trim_prefix (/home/frenchja/node5/node_modules/express/lib/router/index.js:271:13)
    at /home/frenchja/node5/node_modules/express/lib/router/index.js:238:9
    at Function.proto.process_params (/home/frenchja/node5/node_modules/express/lib/router/index.js:313:12)
    at /home/frenchja/node5/node_modules/express/lib/router/index.js:229:12
6/2 21:12 [27679] - error: /
 Error: invalid csrf token
    at module.exports (/home/frenchja/node5/node_modules/csurf/node_modules/http-errors/index.js:32:16)
    at verifytoken (/home/frenchja/node5/node_modules/csurf/index.js:237:11)
    at Object.csrf [as applyCSRF] (/home/frenchja/node5/node_modules/csurf/index.js:100:7)
    at Object.middleware.buildHeader (/home/frenchja/node5/src/middleware/middleware.js:187:13)
    at /home/frenchja/node5/src/routes/index.js:197:15
    at Layer.handle [as handle_request] (/home/frenchja/node5/node_modules/express/lib/router/layer.js:82:5)
    at trim_prefix (/home/frenchja/node5/node_modules/express/lib/router/index.js:271:13)
    at /home/frenchja/node5/node_modules/express/lib/router/index.js:238:9
    at Function.proto.process_params (/home/frenchja/node5/node_modules/express/lib/router/index.js:313:12)
    at /home/frenchja/node5/node_modules/express/lib/router/index.js:229:12
6/2 22:34 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: ROWSDOWER! [OC], published date: Fri, 06 Feb 2015 19:40:23 -0800
7/2 07:32 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: Weekend Discussion Thread: MST3K-Themed Band Names/Songs, published date: Sat, 07 Feb 2015 05:30:49 -0800
7/2 08:08 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: Now Available from RiffTrax…, published date: Sat, 07 Feb 2015 05:57:55 -0800
7/2 13:44 [27679] - info: [plugin-rss] posting, http://interociter.movieholics.tv/feed/ - title: Interociter TV: Streaming Live for the Weekend, published date: Sat, 07 Feb 2015 11:28:07 -0800
7/2 13:44 [27679] - error: [[error:no-privileges]]
7/2 13:44 [27679] - error: [[error:no-privileges]]
7/2 13:44 [27679] - error: [[error:no-privileges]]
7/2 17:00 [27679] - info: [user/jobs] Digest (day) scheduling completed.
7/2 17:00 [27679] - info: [emailer.mandrill] Sent `digest` email to uid 11
7/2 18:25 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:25 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:25 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/theworstofnetflix/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/theworstofnetflix/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/theworstofnetflix/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:28 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed http://feeds.feedburner.com/RiffTrax?format=xml getaddrinfo ESRCH
7/2 18:29 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed http://www.cinematicftp.com/feed/ getaddrinfo ESRCH
7/2 18:30 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed http://www.mst3kinfo.com/?feed=rss2 getaddrinfo ESRCH
7/2 19:16 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: [Reference Explained] The origin of the Rommel joke., published date: Sat, 07 Feb 2015 14:27:36 -0800
7/2 19:16 [27679] - error: [plugins/solr] Could not index post 2163, error: HTTP status 503.Reason: {"responseHeader":{"status":503,"QTime":1749},"error":{"msg":"Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later.","code":503}}
7/2 19:52 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: Did know that the director of Space Mutiny was also in West Side Story?, published date: Sat, 07 Feb 2015 17:06:09 -0800
7/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sat, 07 Feb 2015 21:01:36 -0800
8/2 07:51 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: MST3k Satellite News - Fall 1991, published date: Sun, 08 Feb 2015 01:30:04 -0800
8/2 10:15 [27679] - warn: Flooding detected! Calls : 21, Duration : 918
8/2 10:15 [27679] - warn: [socket.io] Too many emits! Disconnecting uid : 1. Message : topics.loadMore
8/2 17:00 [27679] - info: [user/jobs] Digest (day) scheduling completed.
8/2 17:00 [27679] - info: [emailer.mandrill] Sent `digest` email to uid 11
8/2 20:52 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: MST3k currently on TV -Retro TV, published date: Sun, 08 Feb 2015 18:11:37 -0800
8/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sun, 08 Feb 2015 21:01:44 -0800
9/2 09:29 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: TIL you can order autographed stills from Final Justice and other Greydon Clark movies. So there's that. (NSFW), published date: Mon, 09 Feb 2015 06:27:06 -0800

My logs are actually littered with the csrf token errors but I've never experienced it myself somehow.

barisusakli commented 9 years ago

7/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sat, 07 Feb 2015 21:01:36 -0800

8/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sun, 08 Feb 2015 21:01:44 -0800

So clearly the published date of the same article changes so it gets reposted. The only way I can think of is to keep track of titles posted per feed and if a duplicate title comes in it won't be posted.

frenchja commented 9 years ago

This must be a problem with the feed generating service for a few websites, then? I wonder if you can hash the content of the post as well? Some posts are weekly updates, thus have the same title but a different body.

pichalite commented 9 years ago

@barisusakli the fix for clusters works fine but I am seeing duplicate posts issue if I reload NodeBB.

barisusakli commented 9 years ago

Yeah maybe

myhash = hash(title) + hash(content)

and then only post an entry if myhash was not posted before. This would still cause double posts though if someone just updates something in the content.

Other alternative is to do the same as above but also check hash(title) and if it is posted before update that topics' content with the new content. Again it would cause problems if you really want to post a new entry with the same title but new content..

@pichalite latest commit might fix that problem.

barisusakli commented 9 years ago

@pichalite nvm I found the cause of that in nodebb https://github.com/NodeBB/NodeBB/issues/2714

pichalite commented 9 years ago

@barisusakli this issue can be closed I think. Have been running the plugin for sometime now and I don't see the duplicate posting anymore.

frenchja commented 9 years ago

@barisusakli did we decide on a method for determining unique post content?

barisusakli commented 9 years ago

Don't think there is a perfect solution. Since the entries that come from the feed don't have a unique id and there is no way to tell if its a new entry or one that is updated ie content changed or title changed.

I might change the plugin so it doesn't post new topics if there is already a topic with the same title and instead update the content of that topic.