Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Can you help me with basic configuration? #308

Closed PaboukCZ closed 7 years ago

PaboukCZ commented 7 years ago

Hello there! I have been looking for some simple web crawler and I found this project and liked it very much. The problem is, that I can't find any useful tutorials for dummies and don't know how to correctly set up it to do the work I want (even after reading the documentation). I want to crawl about 10 different web sites, here is an example url: start web page There, I want to crawl over all the departments (there are 5 of them together) and all their courses (course code and course title has the same url). For each of the course I want to save its title and then its content. Content for each of the course looks like this: course detail In this url I want to get (now for simplicity) only the Annotation and Lecture syllabus text.

So the output I want to achieve (for the first department) is something like:

Dept of Computer Science:
- Advanced Algorithms
-- Annotation: "text"
-- Lecture syllabus: "text"
- Algorithms and Graphs 1
-- Annotation: "text" ....

and so on. Then I want to store these informations into ElasticSearch, but that's not a part of this question.

Please, can someone give me a clue, how to achieve this, how to set the XML configuration files? Is this thing I need to achieve even posible? Every advice or link for some tutorial/examples is appreciated. Thanks!

essiembre commented 7 years ago

Look at example configurations that come with the project. By specifying proper start URLs, it should just crawl your entire site. You can specify a maxDepth to only crawl a few level deep. I would start with this to see if you get that far. Then, start refining (clearing up your "workdir" between each run to start from fresh).

On the pages where you want to extract specific fields, look at the importer configuration configuration options. You can configure the importer section of your config to use something like the TextPatternTagger to store the content of matching patterns in the field(s) of your choice (which will be sent to your elasticsearch).

PaboukCZ commented 7 years ago

Thanks for your answer! Well, I did the first step earlier before my question, I managed to crawl all the courses available from the start page (I just set maxDepth to 1). The problem is I dont know how to configure the importer (and TextPatternTagger) to extract only the specific informations I want. I will try to read through the documentation again and set the TextPatternTagger somehow then.

essiembre commented 7 years ago

You define the <importer> section anywhere in your <crawler ...> section. Then to apply handlers (taggers, transformers, filters, splitters), you have the option to do it before text has been extracted from files, or after (pre vs post parse handlers). In your case, if you need to rely on HTML tags to be present to find patterns you want, you may want to do it as pre-parse handlers. You should make sure to limit it to HTML pages too, or you may try to match text patterns out of binary files (e.g. PDF). For instance, if you know your annotations always have this pattern in your HTML pages (I look at one of your page):

    <p> <b>Annotation:</b><br>
      ... the actual text here... 
      <br>
      </p>

Then you can probably try something like this (not tested):

<importer>
    <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
          <pattern field="annotation" group="1"><![CDATA[
            <p>\s+<b>Annotation:</b><br>\s+(.*?)\s+<br>\s+</p>
          ]]></pattern>
          <restrictTo field="document.contentType">text/html</restrictTo>
        </tagger>
    </preParseHandlers>
</importer>
PaboukCZ commented 7 years ago

The regexp you wrote should be correct, thanks for this example! So I have tried to run the crawl with this importer setting (pastebin) and still getting "strange" results. In the CrawledFiles folder, when I open some random crawled page file (Am I looking for the files with .cntnt suffix, right?) I see this results: one of the result content files. Is this right? Am I still missing something or the pattern field ("annotation") I set to look for is stored somewhere else? The project is not "bind" to ElasticSearch yet, may this be the thing?

essiembre commented 7 years ago

When using the Filesystem Committer, the .cntnt files hold the extracted text. Extracted/created fields are stored in .meta files (Java Properties file format).

Having a look at these file is a great way to see what exactly gets produced, but I invite you to look at the DebugTagger as an other option to make your life easier when you troubleshoot/implement. That tagger will print in your log the field values so far. You can try putting this after your TextPatternTagger:

  <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
          logFields="annotation" logContent="false" logLevel="INFO" />

When triggered, the above example will log as INFO what is in your "annotation" field. You can leave the logFields out to print all fields.

PaboukCZ commented 7 years ago

Thanks for your advices! Now I see I'm getting "null" value for my "annotation" filed, so the problem is in the RegExp. Now, I will try to play with it to get all the informations I need. Thanks again.

essiembre commented 7 years ago

No problem!

PaboukCZ commented 7 years ago

Well, one more question about RegExps after all. When I'm trying to get the Lecture Syllabus (on this page for example) - it has very "ugly" table layout. So i wrote some RegExp like this: <![CDATA[<p>\s*<b>Lecture syllabus:</b>\s*<br>\s*<TABLE CELLPADDING="0" CELLSPACING="0">\s*(?:<TR VALIGN="BASELINE"><TD\sALIGN="RIGHT">.*\.</TD><TD>&nbsp;</TD><TD>(.*)</TD></TR>\s*)*]] I think, it works fine. The problem is in number of groups I need to match. I need to repeat the <tr>...</tr> tag unknown-times, but the RegExp is returning only the last matched value (only the last table row). I found this problem on stackoverflow and the solution was (in Java application) to use Matcher.find() function. Is there somethin simmilar, when I'm using only the XML configuration file for my crawl?

essiembre commented 7 years ago

If you have trouble doing one thing with a specific handler, keep in mind you can use more than one to help you out. There might be a few ways to accomplish what you want, but here is an example:

<importer>
    <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
            <pattern field="lectureSyllabusTable" group="1"><![CDATA[
              <p>\s+<b>Lecture syllabus:</b><br>\s+<TABLE.*?>(.*?)</TABLE>
            ]]></pattern>
            <restrictTo field="document.contentType">text/html</restrictTo>
        </tagger>

        <tagger class="com.norconex.importer.handler.tagger.impl.SplitTagger">
            <split fromField="lectureSyllabusTable" toField="lectureSyllabusRows" regex="true">
                <separator><![CDATA[
                  </TD></TR>
                ]]></separator>
            </split>
            <restrictTo field="document.contentType">text/html</restrictTo>
        </tagger>

        <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
            <replace fromField="lectureSyllabusRows" toField="lectureSyllabus"
                     regex="true">
                <fromValue><![CDATA[
                  <TR.*<TD>(.*)
                ]]></fromValue>
                <toValue>$1</toValue>
            </replace>
          <restrictTo field="document.contentType">text/html</restrictTo>
        </tagger>        

        <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger"
                logFields="lectureSyllabus" logContent="false" logLevel="INFO" />        
    </preParseHandlers>
</importer>

If you find it too verbose you can also create your own Tagger that will do exactly what you want using Java.

PaboukCZ commented 7 years ago

This works fine! Thanks again for your advices and clear examples!

essiembre commented 7 years ago

Just a FYI. Small enhancements were made to ReplaceTagger and can be found in the latest Importer snapshot release. They are new flags: wholeMatch and replaceAll. The later now allows you to replace all occurrences of a match in the fromValue with the value of toValue. With this new option, you can simplify further your last use case (assuming you want all text as a single value instead of multi-value):

        <tagger class="com.norconex.importer.handler.tagger.impl.TextPatternTagger">
            <pattern field="lectureSyllabusTable" group="1"><![CDATA[
              <p>\s+<b>Lecture syllabus:</b><br>\s+<TABLE.*?>(.*?)</TABLE>
            ]]></pattern>
            <restrictTo field="document.contentType">text/html</restrictTo>
        </tagger>
        <tagger class="com.norconex.importer.handler.tagger.impl.ReplaceTagger">
            <replace fromField="lectureSyllabusTable" toField="lectureSyllabus"
                     regex="true" replaceAll="true">
                <fromValue><![CDATA[
                  <TR.*?&nbsp;</TD><TD>(.*?)</TD></TR>
                ]]></fromValue>
                <toValue>$1</toValue>
            </replace>
          <restrictTo field="document.contentType">text/html</restrictTo>
        </tagger>

This importer update is included in the latest HTTP Collector snapshot release.