Messy code in 'collector.referrer-link-title'

bruce-genhot commented 8 years ago

We already have most of issues about messy code resolved, but still one remaining, here is my configuration.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Minimum Config HTTP Collector">
    <progressDir>./www.hngzzx.com/progress</progressDir>
    <logsDir>./www.hngzzx.com/logs</logsDir>
    <crawlers>
        <crawler id="www.hngzzx.com">
            <startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
                <url>
                    <![CDATA[http://www.hngzzx.com/HomePage/ShowList.aspx?tbid=11]]>
                </url>
            </startURLs>
            <workDir>./www.hngzzx.com</workDir>
            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include"
                        caseSensitive="false">
                    <![CDATA[(http://www\.hngzzx\.com/HomePage/ShowList\.aspx\?tbid=11)|(http://www\.hngzzx\.com/HomePage/ShowInfoDetail\.aspx\?Id=3997&TableID=11)]]>
                </filter>
            </referenceFilters>
            <importer>
            </importer>
            <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                <directory>./examples-output/minimum/crawledFiles</directory>
            </committer>
            <linkExtractors>
                <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" maxURLLength="2048" ignoreNofollow="true" keepReferrerData="true">
                    <contentTypes>text/html</contentTypes>
                    <tags>
                        <tag name="a" attribute="href"/>
                    </tags>
                </extractor>
            </linkExtractors>
        </crawler>
    </crawlers>
    <crawlerDefaults>
        <maxDepth>1</maxDepth>
        <robotsTxt ignore="true"/>
        <robotsMeta ignore="true"/>
        <sitemap ignore="true"/>
        <sitemapResolverFactory ignore="true"/>
        <delay default="5000"/>
    </crawlerDefaults>
</httpcollector>

there are spaces in 'collector.referrer-link-title', it's incorrect. see below. messy code

thanks.

essiembre commented 8 years ago

The way text files were read for link extraction could cause some characters to be broken in specific circumstances, resulting in spaces in this case. This has been fixed in the latest snapshot release. Please confirm.

bruce-genhot commented 8 years ago

OK, thanks.

bruce-genhot commented 8 years ago

Yes, fixed.

Norconex / crawlers

Messy code in 'collector.referrer-link-title' #229