Open pro-sumer opened 2 years ago
my guess is that the <lastBuildDate>
is not being updated. if you look at the feed above, the lastBuildDate
is older than the newest item's pubDate
. maybe I made a mistake in how I'm using lastBuildDate
, but I thought it was supposed to reflect whenever there's a change, ie a new item, in the feed.
https://cyber.harvard.edu/rss/rss.html#optionalChannelElements
feedsub checks if the feed's last date is the same in order to save on some CPU and bandwidth if the feed is very big. the fix is simple if i'm using lastBuildDate
incorrectly.
https://github.com/fent/node-feedsub/blob/v0.7.8/src/feedsub.ts#L255-L258
Yes, I noticed that exact issue in this particular feed and I'm currently experimenting with the lastBuildDate
check disabled.
I have more feeds that don't update at all after the initial batch; I hope to find time to investigate those as well, now that I have inspected your code (and learned a few things from doing that!).
Another feed seems to fail because the pubDate
(and lastBuildDate
) field contains a date string (Tue, 22 Feb 2022 15:00:24 CET
) that cannot be converted to a JavaScript Date
object (it would be OK without the trailing CET
I think?) causing getItemDate
to return Invalid Date
:
https://github.com/fent/node-feedsub/blob/v0.7.8/src/feedsub.ts#L274
The sortOrder
then becomes NaN
instead of a negative/positive number or zero:
https://github.com/fent/node-feedsub/blob/v0.7.8/src/feedsub.ts#L278
What can be done about this?
All these problematic feeds have relatively few entries. Would it be possible to make all these "optimisations" in node-feedsub optional and let node-newsemitter take care of only publishing new entries?
CET
? that's not part of the spec https://www.ietf.org/rfc/rfc822.txt
but maybe feedsub could have a fallback if parsing the date results in NaN
All these problematic feeds have relatively few entries. Would it be possible to make all these "optimisations" in node-feedsub optional and let node-newsemitter take care of only publishing new entries?
for the feeds with NaN dates, try increasing the maxHistory
. by default it's 10. so without being able to tell what is older than what, it'll compare some random (because sorting by NaN will be random I think?) set of 10 items, and see if any of them are not in the current history.
I'm already using maxHistory
999 instead of 10.
I'll try to investigate a bit further later this week (either by cleaning up feeds before feeding them to feedsub or by forking/modifying feedsub, but I'm not sure yet what would be the best approach).
I'm already using
maxHistory
999 instead of 10.
has that fixed the issues with the feeds with invalid dates?
I'll try to investigate a bit further later this week (either by cleaning up feeds before feeding them to feedsub or by forking/modifying feedsub, but I'm not sure yet what would be the best approach).
i'm willing to remove the check for lastBuildDate
if it's being used incorrectly. it's a very small optimization anyway, it wouldn't change behavior.
for the invalid (NaN) dates, using either null
or 0
would be better, or the original date string. then at least the sorting comparison between items would be consistent
No, 999 did unfortunately not help for the invalid dates.
I'm currently experimenting with htmlparser2
(and feed
), where I only check the guid
of an item to see whether it is new or not. Skipping all the nice optimisations from feedsub
seems to work better for these problematic feeds (so far).
Not blaming feedsub
though, as these feeds are indeed invalid.
(and still using feedsub
in other projects that luckily only work with valid feeds)
when i google "rss lastBuildDate" i get a bunch of results about rss libraries implementing this incorrectly as per the rss spec, including wordpress. i think it's safe to ignore this field
Hi guys, I've created a pull request to make this library more customizable. I've got exactly the same problem as you described here, and for my use case, I needed to get a value from the item by a unique key.
@fent
This changes is backward compatible, I appreciate if someone can review-merge-release https://github.com/fent/node-feedsub/pull/65
On a Raspberry Pi, I run a Node.js script that uses multiple node-feedsub instances, to fetch new items for several RSS feeds (every hour).
This is how I create instances for every RSS feed:
For each feed, node-feedsub fetches all current articles, when I start the Node.js script. For some, it will also fetch updates every hour. For others, it does not fetch any updates (it reports 0 new items every hour, but there are new items - checked by looking at the affected RSS feeds manually). If I then restart the script, it will fetch all the missing articles at start, but again no updates after that.
Example of an affected feed: https://seths.blog/rss
What can I be doing wrong?
(Is there a limit to the number of instances, since some work and others don't?)