Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Import only pages with url matched regexp #13

Closed AntonioAmore closed 9 years ago

AntonioAmore commented 9 years ago

I want importer accept only pages which url match regexp from config. I believe java Class RegexMetadataFilter does that. The question is: which metadata field match page's url and does it exist at all. I use default importer configuration and don't re-map any field.

essiembre commented 9 years ago

Note that a bug has been fixed recently when using onMatch="include" with the filters (see https://github.com/Norconex/collector-http/issues/108) so make sure you use the latest snapshot release.

The answer to your question: document.reference. Like this:

<importer>
    ...
    <preParseHandlers>
        <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
                onMatch="include" field="document.reference" >
            .*blah.*
        </filter>
    </preParseHandlers>
    ...
</importer>

If you also use a collector and you would like to filter URLs before they are downloaded, you can check the filtering options in your collector. For instance, the HTTP Collector offers this:

<crawler id="blah">
    ...
    <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter" 
               onMatch="include">
          .*blah.*
        </filter>
    </referenceFilters>
    ...
</crawler>

This would prevent a file from being downloaded. If you are not using a collector, or you want links to be extracted from a page before it is rejected, then using a filter in the importer like you want to do is the best approach.

If you want to see what fields are available to you for a document without having to explicitly save them all somewhere, have a look at the DebugTagger. You can add this tagger multiple times if you want to see what changed at different times in the import flow.

AntonioAmore commented 9 years ago

May I suppose the filter works assuming the url is absolute? Or it works with relative urls?

One more question: is

<filter ...>~api.*$~is</filter>

a correct regexp usage for filter (preParseHandlert)?

essiembre commented 9 years ago

URLs are always absolute. Relative URLs in a page are converted to absolute right when they are read/extracted.

About your regex, yes, it is where you would put it. Is that what you want to know, or are you questioning if the regex itself is valid? I can't help with the later as I do not know what you are trying to match. One thing for sure, if you want to match $ as a character, you will have to escape it (i.e. \$).

AntonioAmore commented 9 years ago

In my question I mean: does current regexp implementation support ~~ instead of // and does it support flags like i(gnore case) and s(ingle line).

By other words: may I put there full-scale PCRE (http://www.pcre.org/), or a kind of 'subset'?


The question 2: I have following config lines:

      <referenceFilters>
            #parse("${configdir}/config-reference-filters.xml")
      </referenceFilters>

config-reference-filter.xml:

<filter class="$filterRegexRef" onMatch="include" caseSensitive="false">~http\://www\.site\.com.*$~is</filter>

config-reference-filter.properties:

filterExtension = com.norconex.collector.core.filter.impl.ExtensionReferenceFilter
filterRegexRef  = com.norconex.collector.core.filter.impl.RegexReferenceFilter

and got following log entry: REJECTED_FILTER: http://www.site.com

Why does it rejects in spite of matching regexp? Using the lastest 2.2.0 snapshot downloaded today.

essiembre commented 9 years ago

I do not think it does match? The regex it uses is the Java one you can find here. It is based on Perl regular expression, with slight differences (all documented).

When you say use ~~ instead of //, I am assuming you are talking about your regex delimiters. You can't specify delimiters in this case. That's probably what is throwing it off. You probably just need this:

<filter ...>^http\://www\.eventstudytools\.com.*</filter>

The caret is likely optional too, since the odds are you won't find an unescaped URL into another URL, but you never know.

The attribute you are already using (caseSensitive="false") should ignore the case.

As for single line, that's what this filter is by default. References should always be single line so I am not sure it makes a difference on the use of characters such as a ^ prefix and $ suffix in that context.

To modify other default regex flags, short of doing it programmatically, you can embed those flags right into your regex. See the javadoc comments for each flag constants in the link above.

AntonioAmore commented 9 years ago

Thank you, I removed unnecessary symbols, but get the following issue:

Pre parse filter looks like:

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter" onMatch="include" field="document.reference" >/api.*$</filter>

Reference filter is:

<filter class="$filterRegexRef" onMatch="include" caseSensitive="false">http\://www\.site\.com.*$</filter>

The log is

www.site.com: 2015-06-03 20:37:19 INFO -          DOCUMENT_FETCHED: http://www.site.com/api-arc (Subject: com.norconex.collector.http.fetch.impl.GenericDocumentFetcher@1588373)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://en.wikipedia.org/wiki/JSON (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: https://www.purdue.edu/opendigital (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://www.ifb.unisg.ch/en (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: https://www.facebook.com/plugins/like.php?href=http%3A%2F%2Fwww.site.com%2FAPI-ARC&layout=standard&show_faces=hide&width=450px&font=arial&height=80px&action=recommend&colorscheme=light&locale=en_US (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://en.wikipedia.org/wiki/Application_programming_interface (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://www.muon-stat.com/ (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://en.wikipedia.org/wiki/Remote_procedure_call (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://apps.muon-stat.com/EST-VIZ/ (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://www.davinci-invest.ch/ (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: http://nivis.org/ (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_FILTER: https://jobs.allianz.com/sap/bc/bsp/sap/zhcmx_erc_ui_ex/index.html?utm_campaign=AZSE&utm_source=marketing&utm_medium=site (Subject: none)
www.site.com: 2015-06-03 20:37:19 INFO -            URLS_EXTRACTED: http://www.site.com/api-arc
www.site.com: 2015-06-03 20:37:19 DEBUG - Document import rejected. Filter=RegexMetadataFilter[restrictions=[],onMatch=AbstractOnMatchFilter [onMatch=INCLUDE],document.reference,regex=/api.*$,caseSensitive=false]
www.site.com: 2015-06-03 20:37:19 INFO -           REJECTED_IMPORT: http://www.site.com/api-arc

As you may see from the log the collector extracted url http://www.site.com/api-arc, but why it rejected? Seems it match to regexp. I use default importer.

essiembre commented 9 years ago

My guess is it does not match because your expressions suggests it must start with /api. This will probably do it for you:

<filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
    onMatch="include" field="document.reference" >.*/api.*</filter>

Since you end your expression with .*, the $ probably has no effect, so I removed it.

You may want to prefix your expression with the domain once more if you do not want to match this for instance:

http://whatever.com/somepath/blah/api/blah.html
AntonioAmore commented 9 years ago

Thank you for the answer. You're absolutely right: Java RE (regular expression) differs a bit from PCRE I tried to apply. And it is logical the app written on Java uses that syntax.

I propose note in documentation, that RE needed are Java-compatible, because it may be important for people, who program with other languages. I was sure until the moment, that PCRE plays well in my configs.

essiembre commented 9 years ago

I am assuming it is working for you now. I added your suggestion about clarifying documentation on the project todo list. It shall be done in the next release or so.