Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Question about DOMTagger #372

Closed akshaybijawe closed 7 years ago

akshaybijawe commented 7 years ago

meta.txt

config.txt Hi Pascal, I have a question about this DOMTagger implementation. This is link that I want to crawl. This is a snippet from the html:

<div class="info-title">Sponsor:</div>
<div class="info-text" id="sponsor"> Rockefeller University </div>
<div class="info-title">Information provided by (Responsible Party):</div>
<div class="info-text">Dana Orange, Rockefeller University</div>

This is my config:

  <importer>
    <preParseHandlers>
        <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
            <dom selector="div.info-title" toField="field1" />
            <dom selector="div.info-text"  toField="Value1" />
        </tagger>
    </preParseHandlers>
  </importer>

In the .meta file, I get the following output. Below is the snippet of .meta file.

Value1=National Center for Research Resources (NCRR)^|~National Center for Research Resources (NCRR)
field1=Sponsor\:^|~Information provided by\:^|~ClinicalTrials.gov Identifier\:

Now my question is, how to extract these fields separately? Also, may I know how to implement this in Java. I have tried this:

DOMTagger dom = new DOMTagger();
DOMExtractDetails domE = new DOMExtractDetails("from", "to", false);
dom.addDOMExtractDetails(domE);

Do we need to configure this dom in HTTPCollectorConfig or HTTPCrawlerConfig? Thank you.

essiembre commented 7 years ago

The separator you see it part of the internal storage format and shows all values were extracted properly in multi-value fields.

Is your goal to create fields with their name matching the value of info-title and their value matching the next info-text (a different field for each pair)? I am afraid the DOMTagger can't do that right now.

What you can do is use or create a specific Committer suited to what you want to do with the data. The values would come as arrays and you could rely on the position of each item to match fields and values.

Since you are already using the HTTP Collector with Java it may be easier to write your own ICommitter or your own IDocumentTagger to extract values and add fields exactly how you want them.

If you have to do it through configuration, you can look at using the ScriptTagger for more flexibility.

We can also turn this ticket into a feature request if you want to have the DOMTagger (or new tagger) handle cases like yours.

Alternatively, if the info-title are always the same in each pages, then you can use something like TextPatternTagger with multiple patterns, one for each type of pairs you want, hardcoding the target field names you want.

Make sense?

akshaybijawe commented 7 years ago

Hi Pascal, thank you for the detailed explanation. Yes, I would like to create fields with their name matching the value of info-title and their value matching info-text. Apart from these, there are different fields in that page (some in the form of table tags, div etc.) which I would also like to extract. Also, I see that the .cntnt file for crawled pages include all of the content from the page including the ones from the tag. I don't know if it would make more sense to just parse through that .cntnt file or write my own ICommitter or IDocumentTagger as you mentioned above. It would be great if this could be turned into a feature. But in the meantime, I will explore the options that you suggested. Thank you again.

essiembre commented 7 years ago

Marking this as a feature request to be able to extract both field names and values from DOM and/or patterns.

akshaybijawe commented 7 years ago

Thank you, Pascal.

essiembre commented 7 years ago

Since this feature requests belong to the Importer module, I am closing this in favor of one I created there: https://github.com/Norconex/importer/issues/52