Norconex / importer

Norconex Importer is a Java library and command-line application meant to "parse" and "extract" content out of a file as plain text, whatever its format (HTML, PDF, Word, etc). In addition, it allows you to perform any manipulation on the extracted text before using it in your own service or application.
http://www.norconex.com/collectors/importer/
Apache License 2.0
33 stars 23 forks source link

Import only certain text from HTML file #87

Closed HappyCustomers closed 5 years ago

HappyCustomers commented 5 years ago

Hi,

I want to import only certain data from the webpage which I am crawling. This data exists between the body tag of the HTML page

   <span class="name"> The ACME Business </span>
   <span class="city">Bangalore</span>
   <span class="url address">www.acme.com</span>

I want to the extract above values into respective field names into database.

I tried the following configuration in preParseHandlers and it is not working

    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger">
      <dom selector=""url address" toField="url" />
      <dom selector="span.name"  toField="name" overwrite="false" matchBlanks="false" />
    </tagger>                             

Which one should I use DOMTagger or TextPatternTagger?
Can U please proved the configuration example for the above?

Thank You

essiembre commented 5 years ago

You have an XML syntax error in the sample you posted (an extra double-quote on the first selector). This one works for me:

<dom selector="span.name"  toField="name"/>

For the URL one, if you can use only one of the classes if you like:

<dom selector="span.url"  toField="url"/>

If you have to match the two classes, this is a syntax that works for me:

<dom selector="span[class='url address']"  toField="url"/>

Refer to JSoup documentation for syntax options.

HappyCustomers commented 5 years ago

Thanks for the quick response. Actually I had missed adding the fields in PostparseHandlers.

I have few more issues where certain data is not getting extracted, will try to resolve on my own else will send u then email with config document. Thanks once again

HappyCustomers commented 5 years ago

Dear Mr. Pascal,

I have sent you the config file by email for your review. one of the fields is not getting extracted. Can you please help

Thank you

essiembre commented 5 years ago

From the DOM selector and the two URLs you provided by email, I can tell the field you want is not extracted simply because they are not on the page. If you view the source for the page, you will not find it.

It seems that the field is dynamically generated using Javascript. The HTTP Collector does not have a built-in Javascript-rendering engine. To crawl javascript-generated content, you can use an external installation of PhantomJS. Have a look at PhantomJSDocumentFetcher.

HappyCustomers commented 5 years ago

Thanks for the solution. Is there any sample config to extract dynamic content using PhantomJs?

essiembre commented 5 years ago

There is one in the provided link to PhantomJSDocumentFetcher documentation. With it, you will get the rendered content. Then you can use the rest of the Collector/Importer features like you normally would.

HappyCustomers commented 5 years ago

I am closing this as I am able to extract static Content from webpages Using DOMTagger. For Dynamic Content trying PhantomJS.