Closed dhildreth closed 7 years ago
I added a new wildcard dynamic field in Solr with type 'string'. Changing this to type 'text_en' seemed to solve the error. However, the crawler still isn't sticking with 'myredacteddomain.com' and indexes electronics-lab.com documents.
Also, I realized there is an extension referenceFilter type, so changed the filter to:
<referenceFilters>
<filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter"
onMatch="exclude"
caseSensitive="false">
zip
</filter>
</referenceFilters>
INFO [CrawlerEventManager] DOCUMENT_COMMITTED_ADD: http://www.electronics-lab.com/badgerboard-lora-future-iot-development-board/
stayOnDomain
will not process subdomain. If you want to do so, set it to false
and use reference filters instead for more flexibility.
For www.electronics-lab.com, it should not be picked up. If you crawled it once, it will remain as an orphan and be processed again by default. I suggest you clear your workdir and try again. You can change it to have orphans deleted instead with the following in your collector config:
<orphansStrategy>DELETE</orphansStrategy>
Thanks for the response. That was helpful. I deleted all files within the workdir and it seems to have worked. :-) I also added the orphansStrategy
line for the future.
Okay, so you're saying stayOnDomain
doesn't work with subdomains then. What about this?
<startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
<sitemap>https://www.myredacteddomain.com/sitemap.xml</sitemap>
<url>https://wiki.myredacteddomain.com</url>
<url>https://support.myredacteddomain.com</url>
</startURLs>
I would expect it to crawl all of these subdomains and stay on the myredacteddomain.com domain. Sound right, or do I need to use reference filters?
The "stayOnDomain" is per start URL, so yes, it will cover both "wiki" and "support" subdomains. From memory, I suspect the sitemap will also be just fine, but let me know if you suspect any issues.
Note though, that unless you disabled sitemap support, you can also put a start URL for "https://www.myredacteddomain.com" and it will automatically look for a "sitemap.xml" file and use it if present.
While it works both ways, using <sitemap>
as a start URL is particularly useful when the sitemap location is not standard, or you want to crawl ONLY the URLs in the sitemap (needs setting the "maxDepth" to zero).
Perfect! Thank you.
You are welcome!
I'm attempting to resolve an error I see when doing an initial test crawl and seeing some strange behavior. First, here's the relevant parts of my config file:
So, I'd like the crawler to stay on 'myredacteddomain.com' and I'd like it to ignore any .zip files it comes across.
Here's what happens when I run the crawler, specifically "Error from server at http://solr.myredacteddomain.com:98765/solr/norconex_crawler: Exception writing document id http://www.electronics-lab.com/wp-content/uploads/2017/04/GERBERs.zip to the index; possible analysis error: Document contains at least one immense term in field="content"
This is strange because 1.) we should be staying on the 'myredacteddomain.com' URL and 2.) the .zip file should be ignored and not attempt to commit to index. Unless I'm reading this message wrong?
Thanks for any help in advance.