Closed SolSearch closed 5 years ago
Thank you for your feedback.
If you know all your URLs are linked somewhere in their lowercase form, then the simplest is to exclude those with any uppercase letter in them. You can use a filter to this effect. Add this to your crawler config:
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*\p{Upper}.*
</filter>
If you have non-ASCII uppercase characters, you can use this regex instead: .*\p{javaUpperCase}.*
.
If some pages are only references with their uppercase version on your site, the above will reject pages you normally want. Another approach would be to normalize the URL to be lowercase. Unfortunately, the GenericURLNormalizer does not provide normalization rules for changing the case past the hostname portion of a URL. We can make it a feature request to add such a rule if you want, you can write your own IURLNormalizer
implementation, or you can use this non-elegant approach:
<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
<replacements>
<replace><match>A</match><replacement>a</replacement></replace>
<replace><match>B</match><replacement>b</replacement></replace>
<replace><match>C</match><replacement>c</replacement></replace>
<replace><match>D</match><replacement>d</replacement></replace>
<replace><match>E</match><replacement>e</replacement></replace>
<replace><match>F</match><replacement>f</replacement></replace>
...
</replacements>
</urlNormalizer>
In any case, be careful about changing the character case of URLs. According to URL specifications, characters after the hostname are case-sensitive. Forcing everything to lowercase will not give expected results on all sites and/or pages. I encourage you to contact the site owner so URLs are consistently referenced.
Let me know how that goes.
I have now tried using the filter to exclude the uppercase URLs and as you have metioned above, it rejected some of the uppercased urls that I would normally want. Can I make a feature request to add a rule to the GenericURLNormalizer to change the case past the hostname? Can you also provide any rules that I could add to ignore the urls with paarmeters? If available, I will see if it also excludes some urls that do not have a corresponding link anywhere without the parameter. Meanwhile, as you have suggested, I am going to ask the site owner to make the urls consitently lower case.
Thanks.
One way to ignore URLs with parameters is to filter them out like this:
<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*\?.*
</filter>
I am marking it as a feature request to add a rule to convert URL paths to lowercase.
Thanks, very much appreciated. Can I also ask you if the crawler has an option to generate the summary / abstract of the document?
There is no summary generator built-in. There is TitleGeneratorTagger which maybe can help, but it is quite limited.
You can create your own by implementing a IDocumentTransformer
. You can also use an ExternalTagger
if you have an external process that can do it (both found in the Importer module).
Are you using the collector with a search engine? If so, search engines usually offer runtime summary generation, which can be influenced by the user query and highlight matching terms.
For some of the sites, I am able to cherry pick the contents within a tag and use them as summary but for other sites I would have to ask the developer to add some tags around the contents.
Yes I am using the collector with Solr. Does it offer the runtime summary generation? I am searching for any documentation but haven't found yet. Do you any other info on how to do with solr?
Thanks again.
Solr highlighting can pull the most relevant fragments for you. Have a look here: https://lucene.apache.org/solr/guide/7_2/highlighting.html
Thanks very much. I have looked at it and have now implemented it successfully.
No problem!
Forgot about the feature request... re-opening!
Hi Pascal,
we need the same feature request too - to convert the whole url to lowercase. Has that been added?
Thanks
The latest snapshot now offers new rules for URL lowercase conversion as well as getting rid of query strings. The following are the new normalization rules added to GenericURLNormalizer:
Hi Pascal,
Appreciate this. Will test this out.
Thanks -yogesh
On Mar 31, 2019, at 12:39 AM, Pascal Essiembre notifications@github.com wrote:
The latest snapshot now offers new rules for URL lowercase conversion as well as getting rid of query strings. The following are the new normalization rules added to GenericURLNormalizer:
removeQueryString lowerCase lowerCasePath lowerCaseQuery lowerCaseQueryParameterNames lowerCaseQueryParameterValues — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Pascal, I am using 2.8.2-SNAPSHOT and not getting the right result for removeQueryString. I am getting duplicates for http://intranet.corp.internal.mycompany.com/about/company_helping_company.asp?credocelebration
http://intranet.corp.internal.mycompany.com/about/company_helping_company.asp?credoweek2016
The last comment is a duplicate of #594. Closing since the original issue has been resolved for some time.
Hello,
I am using the Norconex collector 2.8.0 to crawl my web sites. It is a great product and thank you for making it available open source.
I want to have just one case-insensitive entry with no parameters in my core. But I am seeing the following multiple URLs in the core:
1.http://www.cco-bcc.gc.ca/news-nouvelles/1044_Part1Canada_Partie1AuCanada-eng.asp?utm_campaign=canadaen&utm_medium=web&utm_source=ccoweb&utm_content=&utm_term
http://www.cco-bcc.gc.ca/news-nouvelles/1044_part1canada_partie1aucanada-eng.asp? utm_campaign=canadaen&utm_medium=web&utm_source=ccoweb&utm_content=&utm_term
http://www.cco-bcc.gc.ca/news-nouvelles/1044_part1canada_partie1aucanada-eng.asp
http://www.cco-bcc.gc.ca/news-nouvelles/1044_Part1Canada_Partie1AuCanada-eng.asp
How can I force the crawler to crawl just http://www.cco-bcc.gc.ca/news-nouvelles/1044_part1canada_partie1aucanada-eng.asp?
Thank You.