Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Questions about DOMTagger - XPath & html #683

Closed svanschalkwyk closed 4 years ago

svanschalkwyk commented 4 years ago

Hi Pascal. For some reason I cannot get XML parsing to work (see below), and only some of my html tags are working. What am I doing wrong?

    <preParseHandlers>
    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="xml">
        <dom selector="/html/body/div[1]/div[1]/div[2]/div/div/section/section/div/div[3]/span/div/div[1]/ul/li[2]/div/div[2]/div[4]" toField="p_test" />
    </tagger>
    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html" >
        <dom selector=".productTile__itemTitle___3B03g div" toField="p_name" />
        <dom selector="[data-automation-id=ingredients] div" toField="p_ingredients" />
        <dom selector=".productTile__imageLinkOld___2SLBT" toField="p_similar_products" />
        <dom selector="img.prod-HeroImageCarousel-image" toField="p_images" />
        <dom selector="NutritionFacts__nutritionFacts___18C6B" toField="p_nutrition_table" />
    </tagger>
essiembre commented 4 years ago

Other than suggesting you try with parser="xml" for both taggers, I cannot tell just looking at this snipper. Can you share your full config with a sample page to reproduce the issue?

svanschalkwyk commented 4 years ago

Does this help? https://gist.github.com/svanschalkwyk/5fe70144e486ea5f0bdaaf4599863dc2 These don't work:

    <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"  parser="html" >
        <dom toField="p_serving_size" selector="[class^=NutritionFacts__ssRowTwo___] h3" extract="html"/>
        <dom toField="p_servings_per" selector="[class^=NutritionFacts__ssRowOne___]" extract="html"/>
        <dom toField="p_calories" selector="span.NutritionFacts__calories___vQPc7" extract="html"/>
        <dom toField="p_nutrition_table" selector="[class^=NutritionFacts__nutrientTable___]" extract="html"/>
        <dom toField="p_price_per" selector=".ProductPage__pricePerUnitOG___7fJIU" extract="html" />
    </tagger>
svanschalkwyk commented 4 years ago

Those fields are not appearing in the html - I need to investigate further. Thanks.

svanschalkwyk commented 4 years ago

There is something I don't understand: The inspect function of Chrome shows different html from the connector-downloaded html files. There are vastly different classes in the downloaded html. What am I not understanding?

essiembre commented 4 years ago

Could it be that the site you crawl is JavaScript-driven? I.e., lots of the content is pulled via Ajax calls? If so you may need a different strategy for crawling. Either crawl the API that JavaScript is calling or use a browser-based approach. Version 2.x of the HTTP Collector relies on PhantomJS for a JavaScript interpreter. See PhantomJSDocumentFetcher. Unfortunately, PhantomJS is no longer being maintained and it has trouble with latest Javascript specs so your mileage may vary. Upcoming version 3 has popular browser support but it is not out yet.

Please create a new ticket for new questions not related to this one (DOMTagger).

svanschalkwyk commented 4 years ago

Hi Pascal I have been using PhantomJS. I didn't realise it had issues already. Thanks!