Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Messy code in 'collector.referrer-link-title' #229

Closed bruce-genhot closed 8 years ago

bruce-genhot commented 8 years ago

We already have most of issues about messy code resolved, but still one remaining, here is my configuration.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Minimum Config HTTP Collector">
    <progressDir>./www.hngzzx.com/progress</progressDir>
    <logsDir>./www.hngzzx.com/logs</logsDir>
    <crawlers>
        <crawler id="www.hngzzx.com">
            <startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
                <url>
                    <![CDATA[http://www.hngzzx.com/HomePage/ShowList.aspx?tbid=11]]>
                </url>
            </startURLs>
            <workDir>./www.hngzzx.com</workDir>
            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include"
                        caseSensitive="false">
                    <![CDATA[(http://www\.hngzzx\.com/HomePage/ShowList\.aspx\?tbid=11)|(http://www\.hngzzx\.com/HomePage/ShowInfoDetail\.aspx\?Id=3997&TableID=11)]]>
                </filter>
            </referenceFilters>
            <importer>
            </importer>
            <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                <directory>./examples-output/minimum/crawledFiles</directory>
            </committer>
            <linkExtractors>
                <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" maxURLLength="2048" ignoreNofollow="true" keepReferrerData="true">
                    <contentTypes>text/html</contentTypes>
                    <tags>
                        <tag name="a" attribute="href"/>
                    </tags>
                </extractor>
            </linkExtractors>
        </crawler>
    </crawlers>
    <crawlerDefaults>
        <maxDepth>1</maxDepth>
        <robotsTxt ignore="true"/>
        <robotsMeta ignore="true"/>
        <sitemap ignore="true"/>
        <sitemapResolverFactory ignore="true"/>
        <delay default="5000"/>
    </crawlerDefaults>
</httpcollector>

there are spaces in 'collector.referrer-link-title', it's incorrect. see below. messy code

thanks.

essiembre commented 8 years ago

The way text files were read for link extraction could cause some characters to be broken in specific circumstances, resulting in spaces in this case. This has been fixed in the latest snapshot release. Please confirm.

bruce-genhot commented 8 years ago

OK, thanks.

bruce-genhot commented 8 years ago

Yes, fixed.