Closed HappyCustomers closed 5 years ago
You have an XML syntax error in the sample you posted (an extra double-quote on the first selector). This one works for me:
<dom selector="span.name" toField="name"/>
For the URL one, if you can use only one of the classes if you like:
<dom selector="span.url" toField="url"/>
If you have to match the two classes, this is a syntax that works for me:
<dom selector="span[class='url address']" toField="url"/>
Refer to JSoup documentation for syntax options.
Thanks for the quick response. Actually I had missed adding the fields in PostparseHandlers.
I have few more issues where certain data is not getting extracted, will try to resolve on my own else will send u then email with config document. Thanks once again
Dear Mr. Pascal,
I have sent you the config file by email for your review. one of the fields is not getting extracted. Can you please help
Thank you
From the DOM selector and the two URLs you provided by email, I can tell the field you want is not extracted simply because they are not on the page. If you view the source for the page, you will not find it.
It seems that the field is dynamically generated using Javascript. The HTTP Collector does not have a built-in Javascript-rendering engine. To crawl javascript-generated content, you can use an external installation of PhantomJS. Have a look at PhantomJSDocumentFetcher.
Thanks for the solution. Is there any sample config to extract dynamic content using PhantomJs?
There is one in the provided link to PhantomJSDocumentFetcher documentation. With it, you will get the rendered content. Then you can use the rest of the Collector/Importer features like you normally would.
I am closing this as I am able to extract static Content from webpages Using DOMTagger. For Dynamic Content trying PhantomJS.
Hi,
I want to import only certain data from the webpage which I am crawling. This data exists between the body tag of the HTML page
I want to the extract above values into respective field names into database.
I tried the following configuration in preParseHandlers and it is not working
Which one should I use DOMTagger or TextPatternTagger?
Can U please proved the configuration example for the above?
Thank You