Closed frenchja closed 7 years ago
I'm not sure why this happens, the plugin saves the lastEntry date into the database and on the next cron tick posts entries that have a newer entry date. https://github.com/barisusakli/nodebb-plugin-rss/blob/master/index.js#L137-L156
So unless the feed entry.publishedDate is changing to a newer date there shouldn't be duplicates.
Is lastEntryDate
a UNIX timestamp? Looking at the field value for HMGET "nodebb-plugin-rss:feed:https://www.reddit.com/r/theworstofnetflix/.rss" "lastEntryDate"
gets me 1420160824000
, which is Sun, 07 Feb 46973 13:46:40 GMT?
The same is for a different entry, HMGET "nodebb-plugin-rss:feed:http://www.mst3kinfo.com/?feed=rss2" "lastEntryDate"
We have 3 appending 0s at the end, creating an incorrect timestamp.
1421211715 vs. 1421211715000
It is in milliseconds I believe so it would be Fri, 02 Jan 2015 01:07:04 GMT
Ah, thanks for the clarification. My fault.
Ill make a commit the timestamps are probably strings need to parseInt them.
Thanks. Testing now with two feeds in a throwaway category. I did notice using ./nodebb dev
that whenever I try to press 'Save' for my feeds that I get this and NodeBB restarts:
error: TypeError: undefined is not a function
at /home/user/node5/src/database/redis/sets.js:24:4
at try_callback (/home/user/node5/node_modules/redis/index.js:573:9)
at RedisClient.return_reply (/home/user/node5/node_modules/redis/index.js:661:13)
at ReplyParser.<anonymous> (/home/user/node5/node_modules/redis/index.js:309:14)
at ReplyParser.emit (events.js:95:17)
at ReplyParser.send_reply (/home/user/node5/node_modules/redis/lib/parser/javascript.js:300:10)
at ReplyParser.execute (/home/user/node5/node_modules/redis/lib/parser/javascript.js:189:22)
at RedisClient.on_data (/home/user/node5/node_modules/redis/index.js:534:27)
at Socket.<anonymous> (/home/user/node5/node_modules/redis/index.js:91:14)
at Socket.emit (events.js:95:17)
What version of nodebb are you on?
v0.6.0. Should be up to date with branch v0.6.x.
Should be fixed in 0.2.7 version of the plugin. Let me know.
Seemed to be working fine when I had it on a 1 minute interval. Now I'm getting the 4 duplicated posts again.
./nodebb log
14/1 15:54 [32444] - error: [[error:too-many-posts-newbie, 120, 3]]
14/1 15:56 [32444] - error: [[error:too-many-posts-newbie, 120, 3]]
This discrepancy in the timestamp is interesting though:
Not always the case though. Here is a feed just added from blog.rifftrax.com:
Could be this
So unless the feed entry.publishedDate is changing to a newer date there shouldn't be duplicates.
Maybe the entry is updated and its published date is updating to a newer date, then it will be reposted as a duplicate.
14/1 15:54 [32444] - error: [[error:too-many-posts-newbie, 120, 3]]
0.2.8 should fix this.
Can you also post the output of smembers nodebb-pluing-rss:feeds
SMEMBERS nodebb-plugin-rss:feeds
1) "https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new"
2) "https://www.reddit.com/r/theworstofnetflix/.rss?sort=new"
3) "http://interociter.movieholics.tv/feed/"
4) "http://feeds.feedburner.com/RiffTrax?format=xml"
5) "http://www.cinematicftp.com/feed/"
6) "http://www.mst3kinfo.com/?feed=rss2"
Is this problem happening on all feeds or just some of them?
I have the same problem with all of the feeds that I have setup in the plugin. Lot of times the duplicates are just topics with 0 posts.
Do you have the same error message in your logs, try with the latest version of this plugin.
I upgraded to the latest version of the plugin. Added a new feed, set it to run at 1 minute interval and #Entries / Interval to 25. I see a lot of these errors.
Feed URL: http://www.mlbtraderumors.com/feed
I see duplicates and some topics with no content.
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22934] - error: [[error:too-many-posts, 10]]
14/1 19:00 [22935] - error: [[error:too-many-posts, 10]]
Try 0.2.9. That error prevents post spamming but in the case of this plugin I reset the posters lastposttime so they can post more than one topic. https://github.com/barisusakli/nodebb-plugin-rss/blob/master/index.js#L201
Duplicates and empty topics increased since the recent updates.
@barisusakli, the problematic feed seems to be http://www.mst3kinfo.com/?feed=rss2
, with other feeds posted correctly this week, but I'll spend a few more days verifying this. I wondered if there's something different about certain feeds that causes problems with the JSON API, but that doesn't seem to be the case, looking at the JSON response.
library(jsonlite)
data <- fromJSON("https://ajax.googleapis.com/ajax/services/feed/load?v=1.0&q=http://www.mst3kinfo.com/?feed=rss2")
data$responseData$feed$entries$title
[1] "Weekend Discussion Thread: Questions about the MSTed Movies"
[2] "New Short from RiffTrax…"
[3] "This Date in MSTory"
[4] "More Scholarly Study of MST3K"
EDIT: Upon logging into the test category, the Reddit RSS feeds are all double posted too.
/unread
appears to only have 1 unread notice per entrie, which is weird. I'll spin up a Fedora instance and try to test the plugin on a fresh forum without other plugins. If it is a plugin problem, it'll be a bit difficult to test the main effect and interaction of so many plugins so your insight might be helpful.
@barisusakli, found an interesting bug. If I enable clustering and set NodeBB to run on 3 ports, when I add an RSS feed to pull 4 entries, it's actually posting around 28 - 60 entries with lot of duplicates. This doesn't happen if I just run NodeBB on 1 port.
That should be easy to fix. I'll look into it.
Above commit should fix the issue with clustering.
Testing now. Hopefully this will fix the issue. Thanks!
@frenchja not sure if this fixes your problem, were you running more than 1 nodebb instances?
@barisusakli Doesn't nodebb start multiple processes depending on the # of CPUs? I'm probably wrong and will dig into the architecture more.
Nah it will read the "port" property from the config.json file if it is an array it will create X nodebbs on each of those ports. If you don't specify a port property it reads the port from the url and spawns a single nodebb. If url doesn't have a port it falls back to 4567.
https://github.com/NodeBB/NodeBB/blob/master/loader.js#L185 https://github.com/NodeBB/NodeBB/blob/master/loader.js#L141 https://docs.nodebb.org/en/latest/configuring/config.html
@barisusakli I removed the URL property in config.json yesterday to run a single NodeBB instance. The first run for RSS feeds worked fine, but after that it started posting duplicates again, 3 for each feed entry. I will try with the latest commit and see if anything changed.
@barisusakli Just checking in. Is there anything odd about the settings below that might cause the behavior?
Would it be possible to add a 'Debug' radio button that outputs the response of each function to a logger? Also, I've listed my npm ls
here in case I've messed up a dependency.
@barisusakli Works way better now with the latest fix for clusters. Doesn't post duplicates or topics without posts anymore.
@frenchja try with 0.2.12 it will print out a line whenever an entry is posted.
Should look like
[plugin-rss] posting, http://feedurl.rss - title: my topic title, published date: <date of publish here>
When you get duplicate posts post your logs.
6/2 21:03 [27679] - error: /
Error: invalid csrf token
at module.exports (/home/frenchja/node5/node_modules/csurf/node_modules/http-errors/index.js:32:16)
at verifytoken (/home/frenchja/node5/node_modules/csurf/index.js:237:11)
at Object.csrf [as applyCSRF] (/home/frenchja/node5/node_modules/csurf/index.js:100:7)
at Object.middleware.buildHeader (/home/frenchja/node5/src/middleware/middleware.js:187:13)
at /home/frenchja/node5/src/routes/index.js:197:15
at Layer.handle [as handle_request] (/home/frenchja/node5/node_modules/express/lib/router/layer.js:82:5)
at trim_prefix (/home/frenchja/node5/node_modules/express/lib/router/index.js:271:13)
at /home/frenchja/node5/node_modules/express/lib/router/index.js:238:9
at Function.proto.process_params (/home/frenchja/node5/node_modules/express/lib/router/index.js:313:12)
at /home/frenchja/node5/node_modules/express/lib/router/index.js:229:12
6/2 21:12 [27679] - error: /
Error: invalid csrf token
at module.exports (/home/frenchja/node5/node_modules/csurf/node_modules/http-errors/index.js:32:16)
at verifytoken (/home/frenchja/node5/node_modules/csurf/index.js:237:11)
at Object.csrf [as applyCSRF] (/home/frenchja/node5/node_modules/csurf/index.js:100:7)
at Object.middleware.buildHeader (/home/frenchja/node5/src/middleware/middleware.js:187:13)
at /home/frenchja/node5/src/routes/index.js:197:15
at Layer.handle [as handle_request] (/home/frenchja/node5/node_modules/express/lib/router/layer.js:82:5)
at trim_prefix (/home/frenchja/node5/node_modules/express/lib/router/index.js:271:13)
at /home/frenchja/node5/node_modules/express/lib/router/index.js:238:9
at Function.proto.process_params (/home/frenchja/node5/node_modules/express/lib/router/index.js:313:12)
at /home/frenchja/node5/node_modules/express/lib/router/index.js:229:12
6/2 22:34 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: ROWSDOWER! [OC], published date: Fri, 06 Feb 2015 19:40:23 -0800
7/2 07:32 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: Weekend Discussion Thread: MST3K-Themed Band Names/Songs, published date: Sat, 07 Feb 2015 05:30:49 -0800
7/2 08:08 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: Now Available from RiffTrax…, published date: Sat, 07 Feb 2015 05:57:55 -0800
7/2 13:44 [27679] - info: [plugin-rss] posting, http://interociter.movieholics.tv/feed/ - title: Interociter TV: Streaming Live for the Weekend, published date: Sat, 07 Feb 2015 11:28:07 -0800
7/2 13:44 [27679] - error: [[error:no-privileges]]
7/2 13:44 [27679] - error: [[error:no-privileges]]
7/2 13:44 [27679] - error: [[error:no-privileges]]
7/2 17:00 [27679] - info: [user/jobs] Digest (day) scheduling completed.
7/2 17:00 [27679] - info: [emailer.mandrill] Sent `digest` email to uid 11
7/2 18:25 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:25 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:25 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/theworstofnetflix/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/theworstofnetflix/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/theworstofnetflix/.rss?sort=new getaddrinfo ESRCH
7/2 18:26 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new getaddrinfo ESRCH
7/2 18:28 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed http://feeds.feedburner.com/RiffTrax?format=xml getaddrinfo ESRCH
7/2 18:29 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed http://www.cinematicftp.com/feed/ getaddrinfo ESRCH
7/2 18:30 [27679] - error: [[nodebb-plugin-rss:error]] Error pulling feed http://www.mst3kinfo.com/?feed=rss2 getaddrinfo ESRCH
7/2 19:16 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: [Reference Explained] The origin of the Rommel joke., published date: Sat, 07 Feb 2015 14:27:36 -0800
7/2 19:16 [27679] - error: [plugins/solr] Could not index post 2163, error: HTTP status 503.Reason: {"responseHeader":{"status":503,"QTime":1749},"error":{"msg":"Error opening new searcher. exceeded limit of maxWarmingSearchers=2, try again later.","code":503}}
7/2 19:52 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: Did know that the director of Space Mutiny was also in West Side Story?, published date: Sat, 07 Feb 2015 17:06:09 -0800
7/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sat, 07 Feb 2015 21:01:36 -0800
8/2 07:51 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: MST3k Satellite News - Fall 1991, published date: Sun, 08 Feb 2015 01:30:04 -0800
8/2 10:15 [27679] - warn: Flooding detected! Calls : 21, Duration : 918
8/2 10:15 [27679] - warn: [socket.io] Too many emits! Disconnecting uid : 1. Message : topics.loadMore
8/2 17:00 [27679] - info: [user/jobs] Digest (day) scheduling completed.
8/2 17:00 [27679] - info: [emailer.mandrill] Sent `digest` email to uid 11
8/2 20:52 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: MST3k currently on TV -Retro TV, published date: Sun, 08 Feb 2015 18:11:37 -0800
8/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sun, 08 Feb 2015 21:01:44 -0800
9/2 09:29 [27679] - info: [plugin-rss] posting, https://www.reddit.com/r/mst3k+bmovies/.rss?sort=new - title: TIL you can order autographed stills from Final Justice and other Greydon Clark movies. So there's that. (NSFW), published date: Mon, 09 Feb 2015 06:27:06 -0800
My logs are actually littered with the csrf token
errors but I've never experienced it myself somehow.
7/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sat, 07 Feb 2015 21:01:36 -0800
8/2 23:04 [27679] - info: [plugin-rss] posting, http://www.mst3kinfo.com/?feed=rss2 - title: This Date in MSTory, published date: Sun, 08 Feb 2015 21:01:44 -0800
So clearly the published date of the same article changes so it gets reposted. The only way I can think of is to keep track of titles posted per feed and if a duplicate title comes in it won't be posted.
This must be a problem with the feed generating service for a few websites, then? I wonder if you can hash the content of the post as well? Some posts are weekly updates, thus have the same title but a different body.
@barisusakli the fix for clusters works fine but I am seeing duplicate posts issue if I reload NodeBB.
Yeah maybe
myhash = hash(title) + hash(content)
and then only post an entry if myhash was not posted before. This would still cause double posts though if someone just updates something in the content.
Other alternative is to do the same as above but also check hash(title) and if it is posted before update that topics' content with the new content. Again it would cause problems if you really want to post a new entry with the same title but new content..
@pichalite latest commit might fix that problem.
@pichalite nvm I found the cause of that in nodebb https://github.com/NodeBB/NodeBB/issues/2714
@barisusakli this issue can be closed I think. Have been running the plugin for sometime now and I don't see the duplicate posting anymore.
@barisusakli did we decide on a method for determining unique post content?
Don't think there is a perfect solution. Since the entries that come from the feed don't have a unique id and there is no way to tell if its a new entry or one that is updated ie content changed or title changed.
I might change the plugin so it doesn't post new topics if there is already a topic with the same title and instead update the content of that topic.
New RSS subscriptions, when published to a category, duplicate each entry multiple times. This was after a fresh install and after the nodebb-rss-plugin hashes and keys were deleted from the RedisDB. Could this be a problem with cron? Interestingly, when an entry is deleted and purged, the duplicates are blank:
This might indicate that the content isn't duplicated, just the Topic.