Closed AndraIonescu closed 8 years ago
Hi, Would you please send your extractor.xml file. Also, did you check your log files? There might be some config problems.
This is my extractors.xml. It is very strange that it doesn't work on that website.
< config xmlns="http://bayan.ir" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd">
< fields>
< field name="image"/>
< /fields>
< documents>
< document url="." engine="css">
< extract-to field="image" >
< concat delimiter="">
< constant value="http://www.bbraun.de"/>
< attribute name="src">
< expr value="html body#product div#wrapper div#container div#main div#main-content div#main-content-text div#main-content-intro p" />
< /attribute>
< /concat>
< /extract-to>
< /document>
< /documents>
< /config>
On a second look, the extractors does work because I can see the constant added in solr, but the css selector doesn't work and I do not understand why. Can you take a look, please?
I can see that that website doesn't have a robots.txt. How come that nutch's default plugins works and this doesn't? I really need it to work because I do not know another way to parse and index the image url.
Thank you very much, Andra.
Hello,
I have this website: http://www.medtronic.com/for-healthcare-professionals/products-therapies/cardiovascular/coronary-stents/integrity-coronary-stent/index.htm and I want to index the image url, but it seems like it doesn't work.
Can you take a look, please? I really need that url.
Thank you, Andra.