alexpdraper / reading-list

A Chrome/Firefox extension for saving pages to read later.
https://chrome.google.com/webstore/detail/lloccabjgblebdmncjndmiibianflabo
110 stars 18 forks source link

Unify links - auto discard duplicates #53

Open hklene opened 6 years ago

hklene commented 6 years ago

Some news-sites trac users by adding clutter around their links:

All open the same article. In fact: http://techstage.de/-3901561 will also do. The same applies to youtube, which will sometimes add video-list-information or the pause-position of the player

I'd like to suggest a way, to first detect such duplicates. I imaging applying a custom regex, to extract the article-number: (techstage.de|heise.de)/.*-(?\d+)(.html)?

The tricky part is, you cannot just take the first/last group, but use a named capturing group: https://github.com/tc39/proposal-regexp-named-groups ... no idea, if Firefox supports them or what will be necessary to make it support them. As an intermediate workaround, you may record the group-number to use along with regex.

This list of regex would be configured in the options. The workflow would be to collect a few links, then go to options and add a new regex to the list, and ideally preview it on the links currently in the list (that is, show a table of URL to ID, sorted by ID and a summary, how many duplicate ID were identified)

Once the regex is accepted and added to the list, it will automatically discard every oldest link, until all are uniq, keeping the youngest-added-instance. On adding a new link, the new link will be checked against each regex in turn and see, if it yields a non-empty ID. This ID could be held in an associative array, picking exactly one link. If there was one before, it'll be automatically discarded on adding the new link.


Once this is up and running, there would be a followup, to e.g. bring all links into a canonical form, so I can again check them against the history #52. But maybe this is asking too much and I better write some GreaseMonkey-Skript to tame each news-site individually, to never offer anything other than the canonical links. The problem with that is only, youtube and consorts constantly obfuscate their UI, thereby scrambling any GreaseMonkey-scripts, that tries to keep up with them.

alexpdraper commented 6 years ago

I would recommend using the search and then manually deleting duplicates yourself. At this point I don’t think this is a big or common enough issue to justify an automated fix.