BayanGroup / nutch-custom-search

65 stars 34 forks source link

Plugin doesn't work #30

Closed AndraIonescu closed 8 years ago

AndraIonescu commented 9 years ago

Hello,

I have this website: http://www.medtronic.com/for-healthcare-professionals/products-therapies/cardiovascular/coronary-stents/integrity-coronary-stent/index.htm and I want to index the image url, but it seems like it doesn't work.

Can you take a look, please? I really need that url.

Thank you, Andra.

tahagh commented 9 years ago

Hi, Would you please send your extractor.xml file. Also, did you check your log files? There might be some config problems.

AndraIonescu commented 8 years ago

This is my extractors.xml. It is very strange that it doesn't work on that website.

< config xmlns="http://bayan.ir" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://bayan.ir http://raw.github.com/BayanGroup/nutch-custom-search/master/zal.extractor/src/main/resources/extractors.xsd"> < fields> < field name="image"/> < /fields> < documents> < document url="." engine="css"> < extract-to field="image" > < concat delimiter=""> < constant value="http://www.bbraun.de"/> < attribute name="src"> < expr value="html body#product div#wrapper div#container div#main div#main-content div#main-content-text div#main-content-intro p" /> < /attribute> < /concat> < /extract-to>
< /document> < /documents> < /config>

AndraIonescu commented 8 years ago

On a second look, the extractors does work because I can see the constant added in solr, but the css selector doesn't work and I do not understand why. Can you take a look, please?

I can see that that website doesn't have a robots.txt. How come that nutch's default plugins works and this doesn't? I really need it to work because I do not know another way to parse and index the image url.

Thank you very much, Andra.