Open stejacob opened 2 months ago
Hello Stephen!
I am marking this as a feature request to add it as an extra normalization option. In the meantime, you can achieve an equivalent with replacements
and regular expression. Here is an example (untested):
<urlNormalizer class="GenericURLNormalizer">
<normalization>
<!-- Your current normalizations here --->
</normalizations>
<replacements>
<replace>
<match>(.*?)(#[^\/]*)$</match>
<replacement>$1</replacement>
</replace>
</replacements>
</urlNormalizer>
Does that work for you?
Hello Pascal,
That approach should work.
We previously used a similar pattern, but we’ll give yours a try—it might be safer than the one we developed since we are not experts in Java regex. :)
Here’s the pattern we used:
`
`
Thank you very much, Pascal.
3.1.0-SNAPSHOT
was just released and now supports a new normalization rule: removeTrailingFragment
. It behaves teh same as removeFragment
except for only considering a hashtag to be a fragment if after the last URL segment (/...)
Please give it a try and confirm.
Excellent Pascal. We will give a try and let you know. Thank very much.
https://opensource.norconex.com/commons/lang/v2/apidocs/com/norconex/commons/lang/url/URLNormalizer.html?is-external=true#removeFragment--
The removeFragments option in the URL normalizer seems to be removing the pound sign (#) in URLs used in Single Page Application (SPA) schemes.
For example, the URL https://forces.ca/en/events/#/details/14742 should remain intact, even when the removeFragments option is applied.
I believe the fragment should only be removed if it appears at the end of the URL, after the last forward slash, like in https://forces.ca/en/career/emergency-medicine/#sec-training.
Thank you.