Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

URLNormalizer - removeFragments should ignore SPA pages #1061

Open stejacob opened 2 months ago

stejacob commented 2 months ago

https://opensource.norconex.com/commons/lang/v2/apidocs/com/norconex/commons/lang/url/URLNormalizer.html?is-external=true#removeFragment--

The removeFragments option in the URL normalizer seems to be removing the pound sign (#) in URLs used in Single Page Application (SPA) schemes.

For example, the URL https://forces.ca/en/events/#/details/14742 should remain intact, even when the removeFragments option is applied.

I believe the fragment should only be removed if it appears at the end of the URL, after the last forward slash, like in https://forces.ca/en/career/emergency-medicine/#sec-training.

Thank you.

essiembre commented 2 months ago

Hello Stephen!

I am marking this as a feature request to add it as an extra normalization option. In the meantime, you can achieve an equivalent with replacements and regular expression. Here is an example (untested):

<urlNormalizer class="GenericURLNormalizer">
  <normalization>
    <!-- Your current normalizations here --->
  </normalizations>
  <replacements>
    <replace>
      <match>(.*?)(#[^\/]*)$</match>
      <replacement>$1</replacement>
    </replace>
  </replacements>
</urlNormalizer>

Does that work for you?

stejacob commented 1 month ago

Hello Pascal,

That approach should work.

We previously used a similar pattern, but we’ll give yours a try—it might be safer than the one we developed since we are not experts in Java regex. :)

Here’s the pattern we used:

`

(.*)(#[^/]*$) $1

`

Thank you very much, Pascal.

essiembre commented 4 weeks ago

3.1.0-SNAPSHOT was just released and now supports a new normalization rule: removeTrailingFragment. It behaves teh same as removeFragment except for only considering a hashtag to be a fragment if after the last URL segment (/...)

Please give it a try and confirm.

stejacob commented 3 weeks ago

Excellent Pascal. We will give a try and let you know. Thank very much.