Squash entries with duplicate providedID on fetch

Sxderp commented 4 years ago

aside: I did this against master. Not sure if it should be against 2.6?

fixes #344

Previously if a there were duplicate providedIDs in a feed's refresh entry list then a new item would be created in Brief. This behavior could still occur on fallback to using entryURL.

Future work, only include the entry with the newest updated timestamp. Although, feeds may already be ordered in this manner and as such futher processing is not required (status: tbd).

==

Little more detail on why the duplicates were occuring and why this fix works.

During "Scan 1: every entry with IDs provided" a map of providedID to entry is created. Which, inevitably, squashes the duplicate providedID. A lookup on all the providedIDs is made and if found the entry from the map is added to the "found" set.

Since this "found" set will only ever have one of the duplicated entries, when a difference on the "found" set and "entries" (everything) is made the duplicates appear as "leftovers" or new results.

By squashing the duplicates in the "entries" list (what this patch does) we avoid the leftover situation.

Sxderp commented 4 years ago

Also fixes #428

tanriol commented 4 years ago

Thank you for this PR. I have to find some time and recheck the corresponding code paths as I'm not sure whether that's the correct place for this fix.

Sxderp commented 4 years ago

Ping.

tanriol commented 4 years ago

Thank you for reminding me :-)

I think this is a correct fix. However, I'd prefer to instead integrate it into _pushFeedEntries itself:

introduce a seenIds set,
in the first scan, add ids to the new set instead of entries,
amend the filters used for passes 2 and 3 to exclude all entries with already seen IDs,
and add a comment somewhere that we intentionally discard duplicates with already seen IDs.

Future work, only include the entry with the newest updated timestamp. Although, feeds may already be ordered in this manner and as such futher processing is not required (status: tbd).

At this point the feed is ordered "older" to "newer", according to the entry order in the RSS document (which is traditionally new on top) as this is the optimal insertion order if the feed has no timestamps. As the new Map(...) constructor keeps the latest item with the given key, this is expected to be the most recent item - or, at least, some good enough substitute. Honestly, I'm not sure whether we should expect a feed with duplicate item IDs to have reasonable timestamps on them... so I'd say this is good enough.

tanriol commented 4 years ago

At this point the feed is ordered "older" to "newer", according to the entry order in the RSS document

Correction: here I'm speaking about _pushFeedEntries. For your version in _feedToEntries it's still in document order, "newer" to "older", so keeping only the first one as you do also works.

Sxderp commented 4 years ago

I think this is a correct fix. However, I'd prefer to instead integrate it into _pushFeedEntries itself:

When debugging I thought about adding the change within _pushFeedEntries but decided against it since it's a significantly more complex section of code. And within _feedToEntries there was already a duplicate ID check so it was an easier add-in.

If I'm understanding suggested flow:

Replace found with a seenIds (maybe knownIds conveys better meaning as a variable name).
When deciding which entries need to be processed for URL matching (Scan 2) change the filter to be something like e => !e.providedID, as anything with a providedID has already been processed.
Aside: I think this filter is redundant. Weren't the entries already filtered in the above for loop (see above link)?
When checking for new entries the items should be filtered off of !seenIds.has(e.providedID) && !found.has(e).

This should work, though no code has been written yet.

With this suggestion found is only used for "Scan 2 (URL Only)". Which is okay, but seems weird. A different variable name might be appropriate.

tanriol commented 4 years ago

When debugging I thought about adding the change within _pushFeedEntries but decided against it since it's a significantly more complex section of code.

Yeah, I see, it's underdocumented. On the bright side, it used to be worse.

Let's merge it as it is and fix later.

* When deciding which entries need to be processed for URL matching (Scan 2) change the [filter](https://github.com/brief-rss/brief/blob/490d0bbfb817b634d1757474aeb130eaa1dc483f/modules/database.js#L678) to be something like `e => !e.providedID`, as anything with a providedID has already been processed.

A weird undocumented special case: Brief has historical support for finding and updating an entry with a matching URL even if one of the entries (either the stored or the new one) has a providedID while the other one has none. When I started working on Brief this was documented to be a workaround for an unknown bug. The bug may be long gone as the parsing backend has been changed radically, but so far I've kept this for special cases like feed generator configuration changes that cause IDs to appear and/or disappear.

* Aside: I think [this filter](https://github.com/brief-rss/brief/blob/490d0bbfb817b634d1757474aeb130eaa1dc483f/modules/database.js#L699) is redundant. Weren't the entries already filtered in the above for loop (see above link)?

Another weird special case: if a feed has no IDs and multiple entries with the same URL (which is a poor but technically legal combination), Brief tries to do something more or less reasonable... which in this case is "match stored entries with new entries by reverse file / database order and apply each updated entry to the next stored one". The inner filter stops being redundant if for some URL we have more than one entry already stored in the database.

brief-rss / brief

Squash entries with duplicate providedID on fetch #491