Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

multiple index entries for the same url #479

Closed SolSearch closed 5 years ago

SolSearch commented 6 years ago

Hello,

I am using the Norconex collector 2.8.0 to crawl my web sites. It is a great product and thank you for making it available open source.

I want to have just one case-insensitive entry with no parameters in my core. But I am seeing the following multiple URLs in the core:

1.http://www.cco-bcc.gc.ca/news-nouvelles/1044_Part1Canada_Partie1AuCanada-eng.asp?utm_campaign=canadaen&utm_medium=web&utm_source=ccoweb&utm_content=&utm_term

  1. http://www.cco-bcc.gc.ca/news-nouvelles/1044_part1canada_partie1aucanada-eng.asp? utm_campaign=canadaen&utm_medium=web&utm_source=ccoweb&utm_content=&utm_term

  2. http://www.cco-bcc.gc.ca/news-nouvelles/1044_part1canada_partie1aucanada-eng.asp

  3. http://www.cco-bcc.gc.ca/news-nouvelles/1044_Part1Canada_Partie1AuCanada-eng.asp

How can I force the crawler to crawl just http://www.cco-bcc.gc.ca/news-nouvelles/1044_part1canada_partie1aucanada-eng.asp?

Thank You.

essiembre commented 6 years ago

Thank you for your feedback.

If you know all your URLs are linked somewhere in their lowercase form, then the simplest is to exclude those with any uppercase letter in them. You can use a filter to this effect. Add this to your crawler config:

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*\p{Upper}.*
</filter>

If you have non-ASCII uppercase characters, you can use this regex instead: .*\p{javaUpperCase}.*.

If some pages are only references with their uppercase version on your site, the above will reject pages you normally want. Another approach would be to normalize the URL to be lowercase. Unfortunately, the GenericURLNormalizer does not provide normalization rules for changing the case past the hostname portion of a URL. We can make it a feature request to add such a rule if you want, you can write your own IURLNormalizer implementation, or you can use this non-elegant approach:

<urlNormalizer class="com.norconex.collector.http.url.impl.GenericURLNormalizer">
  <replacements>
    <replace><match>A</match><replacement>a</replacement></replace>
    <replace><match>B</match><replacement>b</replacement></replace>
    <replace><match>C</match><replacement>c</replacement></replace>
    <replace><match>D</match><replacement>d</replacement></replace>
    <replace><match>E</match><replacement>e</replacement></replace>
    <replace><match>F</match><replacement>f</replacement></replace>
    ...
  </replacements>
</urlNormalizer>

In any case, be careful about changing the character case of URLs. According to URL specifications, characters after the hostname are case-sensitive. Forcing everything to lowercase will not give expected results on all sites and/or pages. I encourage you to contact the site owner so URLs are consistently referenced.

Let me know how that goes.

SolSearch commented 6 years ago

I have now tried using the filter to exclude the uppercase URLs and as you have metioned above, it rejected some of the uppercased urls that I would normally want. Can I make a feature request to add a rule to the GenericURLNormalizer to change the case past the hostname? Can you also provide any rules that I could add to ignore the urls with paarmeters? If available, I will see if it also excludes some urls that do not have a corresponding link anywhere without the parameter. Meanwhile, as you have suggested, I am going to ask the site owner to make the urls consitently lower case.

Thanks.

essiembre commented 6 years ago

One way to ignore URLs with parameters is to filter them out like this:

<filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="exclude">
.*\?.*
</filter>

I am marking it as a feature request to add a rule to convert URL paths to lowercase.

SolSearch commented 6 years ago

Thanks, very much appreciated. Can I also ask you if the crawler has an option to generate the summary / abstract of the document?

essiembre commented 6 years ago

There is no summary generator built-in. There is TitleGeneratorTagger which maybe can help, but it is quite limited.

You can create your own by implementing a IDocumentTransformer. You can also use an ExternalTagger if you have an external process that can do it (both found in the Importer module).

Are you using the collector with a search engine? If so, search engines usually offer runtime summary generation, which can be influenced by the user query and highlight matching terms.

SolSearch commented 6 years ago

For some of the sites, I am able to cherry pick the contents within a tag and use them as summary but for other sites I would have to ask the developer to add some tags around the contents.

Yes I am using the collector with Solr. Does it offer the runtime summary generation? I am searching for any documentation but haven't found yet. Do you any other info on how to do with solr?

Thanks again.

essiembre commented 6 years ago

Solr highlighting can pull the most relevant fragments for you. Have a look here: https://lucene.apache.org/solr/guide/7_2/highlighting.html

SolSearch commented 6 years ago

Thanks very much. I have looked at it and have now implemented it successfully.

essiembre commented 6 years ago

No problem!

essiembre commented 6 years ago

Forgot about the feature request... re-opening!

dtcyad1 commented 5 years ago

Hi Pascal,

we need the same feature request too - to convert the whole url to lowercase. Has that been added?

Thanks

essiembre commented 5 years ago

The latest snapshot now offers new rules for URL lowercase conversion as well as getting rid of query strings. The following are the new normalization rules added to GenericURLNormalizer:

dtcyad1 commented 5 years ago

Hi Pascal,

Appreciate this. Will test this out.

Thanks -yogesh

On Mar 31, 2019, at 12:39 AM, Pascal Essiembre notifications@github.com wrote:

The latest snapshot now offers new rules for URL lowercase conversion as well as getting rid of query strings. The following are the new normalization rules added to GenericURLNormalizer:

removeQueryString lowerCase lowerCasePath lowerCaseQuery lowerCaseQueryParameterNames lowerCaseQueryParameterValues — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

shradhatx commented 5 years ago

Hi Pascal, I am using 2.8.2-SNAPSHOT and not getting the right result for removeQueryString. I am getting duplicates for http://intranet.corp.internal.mycompany.com/about/company_helping_company.asp?credocelebration

http://intranet.corp.internal.mycompany.com/about/company_helping_company.asp?credoweek2016
essiembre commented 5 years ago

The last comment is a duplicate of #594. Closing since the original issue has been resolved for some time.