Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

DOMTagger not selecting CSS as per Chrome Dev Tools #684

Closed svanschalkwyk closed 4 years ago

svanschalkwyk commented 4 years ago

This configuration has to ideally return "Fresh Raspberries...", but I cannot get anything out.

<tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger" parser="html" >
    <dom toField="p_name" selector="div.productTile__details___3lfva > a > div" extract="attr(aria-label)" defaultValue="NAME"/>
I've also tried: ".productTile__details___3lfva a div" with extract=html (etc....)

What am I doing wrong? Some selectors do seem to work, others not.

<div class="productTile__details___3lfva" data-automation-id="productTileDetailsOld">
<a aria-hidden="false" class="productTile__itemTitle___3B03g" data-automation-id="link" href="/ip/Fresh-Raspberries-12-oz/44390957">
<div aria-label="Fresh Raspberries, 12 oz" data-automation-id="name" name="Fresh Raspberries, 12 oz">Fresh Raspberries, 12 oz</div></a>
essiembre commented 4 years ago

Isn't this a duplicate of #683?

It works when I try. I get:

p_name = Fresh Raspberries, 12 oz

Please share your full config and a URL to reproduce.

svanschalkwyk commented 4 years ago

My http crawler is exhibiting strange behaviour. Tags that were extracted a short while ago, fail to extract when new extractions are added below them. Is it possible for you to take a look at what I have please? https://gist.github.com/svanschalkwyk/6ae9dfe8db317495078e2b573efaa78a

svanschalkwyk commented 4 years ago

Pascal, how to parse the $1.98 price from the snippet below? Mine is not working.

<dom toField="p_price" selector=".Price__salePriceItemDetail___3zgGh" extract="outerHtml" />
out of "class="Price__salePrice___3YEJa 
Price__salePriceItemDetail___3zgGh" 
data-automation-id="salePrice">$1.98</span>"

I've tried span.PricesalePrice3YEJa.PricesalePriceItemDetail3zgGh and .PricesalePrice3YEJa.PricesalePriceItemDetail3zgGh and .PricesalePriceItemDetail3zgGh and [class=... and [class^=PricesalePriceItemDetail3zg ...

from

<div class="ProductPage__priceContainer___3WJNU">
<div class="ProductPage__groceryPriceOGContainer___iqCCL">
<span class="Price__containerPrice___qhIUd">
<span class="screenReaderOnlyText">1 dollar and 98 cents</span>
<span aria-hidden="true" class="Price__salePrice___3YEJa Price__salePriceItemDetail___3zgGh" data-automation-id="salePrice">$1.98</span>
</span>
<sub class="displayConditionPrice"></sub>
</div>
<div class="ProductPage__priceRightInfoNoOldPrice___3frai" data-automation-id="weighted-display-condition">each</div>
<div class="ProductPage__priceRightInfoNoOldPrice___3frai">
<span class="screenReaderOnlyText"></span><span class="screenReaderOnlyText">unit price is 1 dollar and 98 cents per pound</span>
<span aria-hidden="true" data-automation-id="price-per-unit" class="ProductPage__pricePerUnitOG___7fJIU">$1.98/LB</span><span aria-hidden="true" class="ProductPage__hideOnDOM___2P0GJ"></span></div></div>
essiembre commented 4 years ago

Can it be related to https://github.com/Norconex/collector-http/issues/683#issuecomment-617951754?

From the Gist you attached the HTML page does not seem to have that HTML content. So it looks like it is being pulled dynamically via JavaScript.

svanschalkwyk commented 4 years ago

It may be something with PhantomJS. I changed the classes to the classes in the downloaded files, and now most work. Some content appearing in the Chrome debugger does not appear in the downloaded file.