extractus / feed-extractor

Simplest way to read & normalize RSS/ATOM/JSON feed data
https://extractor-demos.pages.dev/feed-extractor
MIT License
163 stars 33 forks source link

Query and local targets of links missing #136

Closed WetHat closed 4 months ago

WetHat commented 4 months ago

The extractor removes the query part of links like (see attached feed and example below):

<link>
   https://jdhitsolutions.com/blog/books/9389/powershell-scripting-and-toolmaking/?utm_source=rss&amp;utm_medium=rss&amp;utm_campaign=powershell-scripting-and-toolmaking
</link>

is returned as

https://jdhitsolutions.com/blog/books/9389/powershell-scripting-and-toolmaking/

and also local link targets:

<link>
    https://gist.github.com/aeveltstra/94806a1230b8165f43e9b4e4dec9bacc#file-powershell-gui-aws-lambda-start-functions-ps1
</link>

is returned as

https://gist.github.com/aeveltstra/94806a1230b8165f43e9b4e4dec9bacc

While in this case the query is not essential, for other feeds it is. The missing local target may be a usability issue for large articles. Hence the link should always be returned in its complete form.

feed.zip

ndaidong commented 4 months ago

This library removes advertising parameters from URLs, while preserving all other parameters intact. If you are not a marketer, it should be fine :dancers:

Here's a list of parameters targeted for removal:

https://github.com/extractus/feed-extractor/blob/main/src/utils/linker.js#L21

WetHat commented 4 months ago

@ndaidong Thanks for the feedback. No, I'm not a marketeer , so removing advertising parameters is fine with me :-) However, local link targets (see second example) seems a bit too much.

ndaidong commented 4 months ago

@WetHat I see, the local link targets you mentioned is hash property in URLs. While they can be useful for scrolling to specific sections, we're removing them with purify() method. This is because feed entries should ideally link to entire articles, not just a specific part.

WetHat commented 4 months ago

@ndaidong I completely agree with your point that feed entries should ideally link to entire articles, not just to specific sections. I've seen a fair share of really weird interpretations of the RSS idea where feed items were pointing to sections inside a large log like article. However, these are corner cases. Sticking with purify() sounds acceptable.