Closed Sxderp closed 4 years ago
Also fixes #428
Thank you for this PR. I have to find some time and recheck the corresponding code paths as I'm not sure whether that's the correct place for this fix.
Ping.
Thank you for reminding me :-)
I think this is a correct fix. However, I'd prefer to instead integrate it into _pushFeedEntries
itself:
seenIds
set,Future work, only include the entry with the newest updated timestamp. Although, feeds may already be ordered in this manner and as such futher processing is not required (status: tbd).
At this point the feed is ordered "older" to "newer", according to the entry order in the RSS document (which is traditionally new on top) as this is the optimal insertion order if the feed has no timestamps. As the new Map(...)
constructor keeps the latest item with the given key, this is expected to be the most recent item - or, at least, some good enough substitute. Honestly, I'm not sure whether we should expect a feed with duplicate item IDs to have reasonable timestamps on them... so I'd say this is good enough.
At this point the feed is ordered "older" to "newer", according to the entry order in the RSS document
Correction: here I'm speaking about _pushFeedEntries
. For your version in _feedToEntries
it's still in document order, "newer" to "older", so keeping only the first one as you do also works.
I think this is a correct fix. However, I'd prefer to instead integrate it into
_pushFeedEntries
itself:
When debugging I thought about adding the change within _pushFeedEntries
but decided against it since it's a significantly more complex section of code. And within _feedToEntries
there was already a duplicate ID check so it was an easier add-in.
If I'm understanding suggested flow:
found
with a seenIds
(maybe knownIds
conveys better meaning as a variable name).e => !e.providedID
, as anything with a providedID has already been processed.!seenIds.has(e.providedID) && !found.has(e)
.This should work, though no code has been written yet.
With this suggestion found
is only used for "Scan 2 (URL Only)". Which is okay, but seems weird. A different variable name might be appropriate.
When debugging I thought about adding the change within
_pushFeedEntries
but decided against it since it's a significantly more complex section of code.
Yeah, I see, it's underdocumented. On the bright side, it used to be worse.
Let's merge it as it is and fix later.
* When deciding which entries need to be processed for URL matching (Scan 2) change the [filter](https://github.com/brief-rss/brief/blob/490d0bbfb817b634d1757474aeb130eaa1dc483f/modules/database.js#L678) to be something like `e => !e.providedID`, as anything with a providedID has already been processed.
A weird undocumented special case: Brief has historical support for finding and updating an entry with a matching URL even if one of the entries (either the stored or the new one) has a providedID
while the other one has none. When I started working on Brief this was documented to be a workaround for an unknown bug. The bug may be long gone as the parsing backend has been changed radically, but so far I've kept this for special cases like feed generator configuration changes that cause IDs to appear and/or disappear.
* Aside: I think [this filter](https://github.com/brief-rss/brief/blob/490d0bbfb817b634d1757474aeb130eaa1dc483f/modules/database.js#L699) is redundant. Weren't the entries already filtered in the above for loop (see above link)?
Another weird special case: if a feed has no IDs and multiple entries with the same URL (which is a poor but technically legal combination), Brief tries to do something more or less reasonable... which in this case is "match stored entries with new entries by reverse file / database order and apply each updated entry to the next stored one". The inner filter stops being redundant if for some URL we have more than one entry already stored in the database.
aside: I did this against master. Not sure if it should be against 2.6?
fixes #344
Previously if a there were duplicate providedIDs in a feed's refresh entry list then a new item would be created in Brief. This behavior could still occur on fallback to using entryURL.
Future work, only include the entry with the newest updated timestamp. Although, feeds may already be ordered in this manner and as such futher processing is not required (status: tbd).
==
Little more detail on why the duplicates were occuring and why this fix works.
During "Scan 1: every entry with IDs provided" a map of providedID to entry is created. Which, inevitably, squashes the duplicate providedID. A lookup on all the providedIDs is made and if found the entry from the map is added to the "found" set.
Since this "found" set will only ever have one of the duplicated entries, when a difference on the "found" set and "entries" (everything) is made the duplicates appear as "leftovers" or new results.
By squashing the duplicates in the "entries" list (what this patch does) we avoid the leftover situation.