Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Multiple start URLs #541

Closed dtcyad1 closed 5 years ago

dtcyad1 commented 5 years ago

Hi,

From my main site http://www.test.com, i only need to crawl certain parts. So I configured like this:

<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
          <url>https://www.test.com/deptA</url>
          <url>https://www.test.com/deptD</url>
      </startURLs>

In order to get all the documents that follow this pattern, i added

<preParseHandlers>
              <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                    onMatch="include" field="document.reference">
            https://www.test.com/deptA/.*
        </filter> 
    <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                    onMatch="include" field="document.reference">
            https://inside.qad.com/deptD/.*
        </filter> 
</preParseHandlers>

However, with this configuration, I only get documents matching the first link, ie https://www.test.com/deptA/xyz/....

and nothing for deptD

Now if I just use one url at a time in the section, then I see results for both the sites (one at a time based on which url in the star url section). But the moment i have both of them in the start url section, only the first url pattern shows up.

Now the interesting part is that if I add this to the filter section,

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                    onMatch="include" field="document.reference">
            https://www.test.com/deptD
        </filter> 

then i now see just the https://www.test.com/deptD page. But when i go into the meta data for this page, i do not see any collector.referenced-urls - (which will appear the moment i switch to just one url for the deptD)

Am I using the filtering incorrectly?

I also tried to have then as referenceFilters outside of the prehandler section - but its still the same issue.

Thanks

essiembre commented 5 years ago

Which version of HTTP Collector are you using? What are the logs saying about the deptD? Do they get rejected? You may have to adjust the log level in the log4j.properties file to find out.

Given it should definitely work, I would need a config (with good URL) that can reproduce the issue.

dtcyad1 commented 5 years ago

Hi Pascal,

i tested this with another site and adding multiple urls does work as it should!! I did find some issues with my website. The links don't play nice with each other. That being said, I think my filters are causing an issue. While I understand the effect of placing the filters either in the importer or outside, i am not sure what the difference is between adding a filter in the pre vs post section of the importer.

By adding this

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter".. in the postImporter section caused all results to be rejected

but adding that to the preImporter section worked.

From the Flow Chart you have provided - Which is really Awesome!!! - I am a little confused.. There should not really be any difference - right?

Thanks

essiembre commented 5 years ago

Without your full config it is hard to say. The important difference between "preParseHandlers" and "postParseHandlers" is you usually have more fields available after parsing (since parsing extracts metadata fields it finds).

But if you filter on document.reference, that one shall be there before and after. Unless maybe you have a KeepOnlyTagger that does not include it?

For troubleshooting, if you want to know what fields are available at any point during the execution of your handlers, you can insert instances of DebugTagger in your importer configuration.

Please share your full config if you would like a deeper look.

Pittiplatsch commented 5 years ago

By adding this

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter".. in the postImporter section caused all results to be rejected

but adding that to the preImporter section worked.

Might the preParseHandler vs. postParseHandler filter issue be related to #86 (same symptom, different filter)?

dtcyad1 commented 5 years ago

Hi Pascal,

thanks for your reply. I have it working great in the the pre handler section. Since the website is not public, i cannot send you the config file. I will close this for now.

Thanks