Closed svanschalkwyk closed 4 years ago
Other than suggesting you try with parser="xml"
for both taggers, I cannot tell just looking at this snipper. Can you share your full config with a sample page to reproduce the issue?
Does this help? https://gist.github.com/svanschalkwyk/5fe70144e486ea5f0bdaaf4599863dc2 These don't work:
<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html" >
<dom toField="p_serving_size" selector="[class^=NutritionFacts__ssRowTwo___] h3" extract="html"/>
<dom toField="p_servings_per" selector="[class^=NutritionFacts__ssRowOne___]" extract="html"/>
<dom toField="p_calories" selector="span.NutritionFacts__calories___vQPc7" extract="html"/>
<dom toField="p_nutrition_table" selector="[class^=NutritionFacts__nutrientTable___]" extract="html"/>
<dom toField="p_price_per" selector=".ProductPage__pricePerUnitOG___7fJIU" extract="html" />
</tagger>
Those fields are not appearing in the html - I need to investigate further. Thanks.
There is something I don't understand: The inspect function of Chrome shows different html from the connector-downloaded html files. There are vastly different classes in the downloaded html. What am I not understanding?
Could it be that the site you crawl is JavaScript-driven? I.e., lots of the content is pulled via Ajax calls? If so you may need a different strategy for crawling. Either crawl the API that JavaScript is calling or use a browser-based approach. Version 2.x of the HTTP Collector relies on PhantomJS for a JavaScript interpreter. See PhantomJSDocumentFetcher. Unfortunately, PhantomJS is no longer being maintained and it has trouble with latest Javascript specs so your mileage may vary. Upcoming version 3 has popular browser support but it is not out yet.
Please create a new ticket for new questions not related to this one (DOMTagger).
Hi Pascal I have been using PhantomJS. I didn't realise it had issues already. Thanks!
Hi Pascal. For some reason I cannot get XML parsing to work (see below), and only some of my html tags are working. What am I doing wrong?