Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Is it possible to extract several records from a single page? #253

Closed liar666 closed 8 years ago

liar666 commented 8 years ago

Hi,

I've used Heritrix for a while, so I understand how to crawl websites. But since I'm not satisfied with Heritrix, I'm currently looking at alternative.

Norconex's API docs are good and the XML config file format very self-explicative. Unfortunately, I still have difficulties to understand the big picture of norconex : what is possible (or not) with this tool and how the various jars interconnect (collector, importer, etc.). Provided examples are too simple for me to understand how to do more than just save the text-only version of a webpage to disk :{ You could improve your docs with more user-oriented tutorials (with respect to currently dev-oriented API docs), for instance starting with a goal like "I want to crawl website XXX, extract data YYY, and store that in ZZZ, thus I do: step 1 xxx, step 2 yyy, etc. and that results in xml file kkk".

As far as I'm concerned, I still do not understand if/how I can do the following:

Let's say I have a page like: https://ideas.repec.org/s/ags/fama04.html On this page, you can find a list of publications, with a title and a list of authors. What I would like to extract is exactly what is above: a list of publications, with a list of authors attached to each one of them.

I have several "records" on each page, is that possible to extract such an info with norconex?

Let's forget about how/where I want to import the data (we have our own format). Let's say that I want to get the data as a "tagged" file like: PUBLICATION: title1 AUTHOR: auth1.1 AUTHOR: auth2.1 AUTHOR: auth3.1 PUBLICATION: title2 AUTHOR: auth1.2 AUTHOR: auth2.2 PUBLICATION: title3 AUTHOR: auth1.3 AUTHOR: auth2.3

Is that possible ? How ?

Now, if I want to iterate the process on several page of the IDEA site (I understood how to do that with referenceFilters), can/should I create such a file:

Now let's say I dont want a text file but to import in a DB with my own format, am I correct in saying that I have to implement an importer ?

essiembre commented 8 years ago

Thanks for your interest in our crawler and on the tutorial ideas. We always try to make make it easier to grasp whenever we can so newcomer experience like yours is valuable and appreciated.

If we try to simplify as much as possible, you can probably picture it mainly as a 3-step process:

  1. Crawling (pulling docs from their original location)
  2. Importing (parsing/extracting/manipulating)
  3. Committing (saving to a repository of your choice).

First the HTTP Collector configuration documentation touches mostly the crawling options.

Once a document is crawled, what gets extracted into what field is done by the Importer module (which is also a module used by the Filesystem Collector). By default it will try to create fields for every piece of metadata naturally found in documents (document properties, HTML meta tags, etc), but you normally want to create your own and filter out those you do not want to keep.

Finally, once you are done extracting/what you want form a document, you have to save it somewhere, and that is the job of a Committer. There are a bunch of Committers already available but you are right that you may have to create your own if your target repository is not readily supported. If you know your Java you should be just fine creating one yourself. Else, Norconex can provide professional assistance and create a custom one for you (or you can file a feature request if time is not of the essence).

Now about your specific use case... It should be relatively easy to do, depending on how complex/simple you want to make it.

I see that your sample URL holds a listing of publications and that each publication has its own page. In each publication pages, I see authors, title, publication date, and a lot of good info already structured in HTML meta tags. Those will naturally get picked up by the crawler and new fields created for you. By crawling those pages you will automatically get the equivalent of 1 "record" per page. That seems like the best option to me (just crawl those pages). So given the start URL you mention in your post, all articles should be crawled how you want. If you do not want to have the listing pages "committed", simply filter them out in the importer as a pre-parse handler. You can do this to filter out the listing pages for instance (not tested):

<importer>
    <preParseHandlers>
      <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
              onMatch="include" field="document.reference" >
         https://ideas.repec.org/p/.*
      </filter>
    </preParseHandlers>
</importer>

That will make sure only the publication pages (/p/.*) will be committed.

Once an HTML page reaches the Importer module, you can assume the URLs it contains were already extracted and will be followed. This mean it can safely be discarded here.

To only keep the fields you want, you can use the KeepOnlyTagger. To rename them you can use the RenameTagger. You can also use the DebugTagger when you test things out to print on screen fields being extracted. It may make your job easier.

If you would rather create individual records form your listing page you provided without crawling individual publication pages, you can also do so by using a document "splitter" like DOMSplitter. But I honestly think you would make your life easier by simply crawling individual pages and extracting what you want from each instead.

Does that answer most of your questions? Let me know if you are still unsure about certain things.

liar666 commented 8 years ago

Hi,

Thanks for your perfect and quick response! It is exactly what I needed!!! You should definitively create a page with this info on your site (the part that is non-specific to my problem) :) It is very clear and simple (with/as a picture, it would be even better)! My only remark would go to the "Importer" name that I find not clearly related to the data-manipulation/transformation task that it actually fulfills, but once we have the clear explanation above, the name is not important, we can get used to it... Also, changing the name of a whole package would not be advisable...

As for now, I'll follow your advice and crawl directly the single "publication pages". However, I have other sites where I need to parse "list-pages" (eg: directories of people, like: http://www.anu.edu.au/_anu/staffdir/search.php?q=plasma&submit=search), and I would love to see an example with DOMSplitter somewhere on your website.

Big thanks again. It was really useful.

essiembre commented 8 years ago

About you suggesting a picture and listing of the 3 steps... there is a video with an example at the end that explains the basic concepts and the three steps are listed next to it at the top of the Collectors home page. Maybe you landed directly on the HTTP Collector website. Regardless, we'll keep in mind adding a nice tutorial and/or pictures explaining the "big picture" of how things work in simple terms. Thanks for your input.

We try to write articles with samples when new features are released. For DOM-related operations, here is a copy-paste from an article on Norconex website:

<!-- Exclude documents containing GIF images. -->
<filter class="com.norconex.importer.handler.filter.impl.DOMContentFilter"
      selector="img[src$=.gif]" onMatch="exclude" />

<!-- Store H1 tags in a title field. -->
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
  <dom selector="h1" toField="title" overwrite="false" />
</tagger>

<!-- Create a new contact document for each occurence of the "contact" tag. -->
<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
    selector="contact" />

It is good to know that, once the DOM splitter does its magic, each individual documents created will be sent back to the Importer module again for processing (as if they were standalone documents). You can of course write your own splitters too.

liar666 commented 8 years ago

Thanks again for your very informative and detailed answers.

Indeed, this morning, I just found the page where the 3 points are explained. You are right, I missed them the first time, since I was referred directly to the collector homepage!