Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

collector.referrer-link-title does not extracted by using GenericLinkExtractor #204

Closed bruce-genhot closed 8 years ago

bruce-genhot commented 8 years ago

collector.referrer-link-text can be extracted correctly, but it does not work for collector.referrer-link-title

essiembre commented 8 years ago

Do you mean it is not extracted at all, or the charset is not correct (related to #194)?

If the first, please mention the URL. If the second, can you try by adding the attribute charset="utf-8" (or whatever encoding you know your page is in)?

bruce-genhot commented 8 years ago

@essiembre , Merry Christmas.

It was not extracted at all. here are my configuration file. by the way, I am trying to use HtmlUnit to handle with javascript/ajax things, it works well for some of pages, but does not work for others, unstable.

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Minimum Config HTTP Collector">
    <progressDir>./www.spprec.com/progress</progressDir>
    <logsDir>./www.spprec.com/logs</logsDir>
    <crawlers>
        <crawler id="www.spprec.com">
            <startURLs stayOnDomain="false" stayOnPort="false" stayOnProtocol="false">
                <url>
                    <![CDATA[http://www.spprec.com/sczw/jyfwpt/005003/005003001/MoreInfo.aspx?CategoryNum=005003001]]>
                </url>
            </startURLs>
            <workDir>./www.spprec.com</workDir>
            <referenceFilters>
                <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" onMatch="include"
                        caseSensitive="false">
                    <![CDATA[(http://www\.spprec\.com/sczw/jyfwpt/005003/005003001/MoreInfo\.aspx\?CategoryNum=005003001)|(http://www\.spprec\.com/sczw/InfoDetail/Default\.aspx\?InfoID=.*)]]>
                </filter>
            </referenceFilters>
            <importer>
                <preParseHandlers>
                    <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
                        <replace>
                            <fromValue>content="text/html; charset=gb2312"</fromValue>
                            <toValue>content="text/html; charset=utf-8"</toValue>
                        </replace>
                        <restrictTo field="document.contentType">text/html</restrictTo>
                    </transformer>
                </preParseHandlers>
                <postParseHandlers>
                    <tagger class="com.norconex.importer.handler.tagger.impl.ConstantTagger">
                        <constant name="Content-Encoding">UTF-8</constant>
                        <constant name="type">采购公告</constant>
                        <constant name="location">中国.四川</constant>
                        <constant name="source">四川公共资源交易网</constant>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
                        <fields>
                            collector.referrer-link-title,document.reference,content,type,location,source,collector.referenced-urls,collector.referrer-reference,timestamp
                        </fields>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
                        <rename fromField="collector.referrer-link-title" toField="title" overwrite="true"/>
                        <rename fromField="document.reference" toField="link" overwrite="true"/>
                        <rename fromField="collector.referrer-reference" toField="reference" overwrite="true"/>
                        <rename fromField="collector.referenced-urls" toField="sourceLink" overwrite="true"/>
                    </tagger>
                    <tagger class="com.norconex.importer.handler.tagger.impl.CurrentDateTagger" field="timestamp"
                            format="yyyy-MM-dd hh:mm:ss" overwrite="true"></tagger>
                </postParseHandlers>
            </importer>
            <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                <directory>./examples-output/minimum/crawledFiles</directory>
            </committer>
        </crawler>
    </crawlers>
    <crawlerDefaults>
        <maxDepth>1</maxDepth>
        <robotsTxt ignore="true"/>
        <robotsMeta ignore="true"/>
        <sitemap ignore="true"/>
        <sitemapResolverFactory ignore="true"/>
        <delay default="5000"/>
    </crawlerDefaults>
</httpcollector>
bruce-genhot commented 8 years ago

Actually, If I use collector.referrer-link-text in above configuration, title can be extracted as expected, but some other being fetched sites have links like below, as you can see, there is html code in text of <a>(I have very special cases), it leads to collector.referrer-link-text can not be used to extract title, so I was thing of using collector.referrer-link-title, found it does not work at all.

<a href="/hubeizxwz/InfoDetail/Default.aspx?InfoID=93c4c068-1618-4687-acfc-0f6cf11fa27c&CategoryNum=004001001001" target="_blank" title="<font color=red>[新系统]</font>武汉武船投资控股有限公司公共租赁住房项目项目报建公告"><font color=red>[新系统]</font>武汉武船投资控股有限公司公共租赁住房项目项目报建公告</a>

So if referrer link text contains html code, then collector.referrer-link-text can not be extracted.

essiembre commented 8 years ago

Are you saying it does not work in your case because of the HTML, or the feature is broken?

Because maybe you can use the ReplaceTagger (Importer module) to strip the HTML from your title?

bruce-genhot commented 8 years ago

It does not work because of html code in the text, I think extractor is running before importer, so maybe ReplaceTagger in importer is not a solution, I will have a try, thanks.

essiembre commented 8 years ago

You are right, it will likely not work. I will try to extract the text as it is then, including the HTML. I did not think putting HTML in the title attribute was a supported HTML practice though. In any case, I will mark this as a feature request to support HTML in href titles.

essiembre commented 8 years ago

Can you provide me with a URL that has those special href titles? I checked the source for http://www.spprec.com/sczw/jyfwpt/005003/005003001/MoreInfo.aspx?CategoryNum=005003001 and I could not find any.

bruce-genhot commented 8 years ago

Yes, please take a look at http://ggzy.jiangxi.gov.cn/jxzbw/jyxx/002001/002001002/MoreInfo.aspx?CategoryNum=002001002. check link with text '[宁都县]宁都县东山坝卫生院保障性住房工程'

essiembre commented 8 years ago

Thanks for the sample, I now understand what you mean with the GenericLinkExtractor and I am able to reproduce. There are two problems:

1) The title="whatever" attribute does not get picked up. 2) The HTML tags you have are in the <a href...> body, and not the title attribute.

Those are definitely things to fix.

essiembre commented 8 years ago

I made a new snapshot release which should fix both issues. Titles are now extracted with either TikaLinkExtractor or GenericLinkExtractor. Same with anchor body text if it contains HTML. Please test and keep an eye open for any regression issues.

bruce-genhot commented 8 years ago

OK, will have a try with the latest snapshot release, thank you.

bruce-genhot commented 8 years ago

@essiembre , it works now.

essiembre commented 8 years ago

@bruce-genhot, that's great! Thanks for confirming and I almost forgot: happy holidays to you too! :-)