Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

DOMSplitter #591

Closed niozasg closed 4 years ago

niozasg commented 5 years ago

Using the dom splitter i extract <img> tags as new documents

<splitter class="com.norconex.importer.handler.splitter.impl.DOMSplitter"
              selector="img"
                sourceCharset="UTF-8"/>

and then i capture the attributes

 <tagger class="com.norconex.importer.handler.tagger.impl.DOMTagger"  >
                <restrictTo field="document.contentType">text/x-php</restrictTo>
                <restrictTo field="document.contentType">text/plain</restrictTo>
                <dom selector = "img" toField="alt" extract = "attr(alt)"/>
                <dom selector = "img" toField="src" extract = "attr(src)"/>
                <dom selector = "img" toField="src" extract = "attr(srcset)"/>
                <dom selector = "img" toField="class" extract = "attr(class)"/>
      </tagger>

So documents are created succesfully from img tags with alt,src etc fields . For some reason when commited to elasticsearch norconex creates also this kind of documents with id which are totally useless:

http://junior.reporter.pl/?s=2!html > body > table > tbody > tr > td > table:nth-child(3) > tbody > tr > td:nth-child(2) > table:nth-child(5) > tbody > tr > td > div > div > img

http://junior.reporter.pl/?s=2!html > body > table > tbody > tr > td > table:nth-child(3) > tbody > tr > td:nth-child(2) > table:nth-child(6) > tbody > tr > td > div > div > img

when crawling http://junior.reporter.pl domain for example. how can I solve this problem Thank you

essiembre commented 5 years ago

When using the DOMSplitter each new document has to have a unique ID to find out if they have changed or need to be deleted on a subsequent run. Because there could be many of the same element name (e.g. "img"), then the only way to uniquely identify them is to store the full/unique DOM path to it.

What is it you want to achieve? Maybe there are other ways.

niozasg commented 5 years ago

Actually, norconex extracts scr variable image.png from But if the same image is found to another page of the domain I am crawling, it creates the above IDs. Until now I create image documets by parsing tags and I index them with their url e.g (www.reporter.com/images/image.png)

Elasticsearch replaces when there is the same ID but norconex I think it creates new ids like that: from this html code:

`

<div id="banner_content"><img src="/Templates/images/9-10_images/logo.png" width="91" height="97" alt="TVA logo"><img src="/Templates/images/9-10_images/tvakids.png" width="250" height="97" alt="tvakids"><img src="/Templates/images/9-10_images/cartoon_kids.png" width="326" height="97" alt="cartoon kids"></div>
--
  | </div>

/Templates/images/9-10_images/logo.png has been already crawled and indexed in a different page of the website but then it is indexed again as:

www.page.com/page/page.htm!#banner_content > img:nth-child(1). How can I solve that?

`

essiembre commented 5 years ago

Ha.. I see. If you are willing to risk having deletions not being picked up because you change the unique reference created by the dom splitter, I would say you can change the sourceReferenceField in your Elasticsearch Committer configuration to chose a field of your choice to be the Elasticsearch ID. That will prevent duplicates. Could that work for you?

niozasg commented 5 years ago

I already use sourceReferenceField for other tasks, but I do no think it is applicable to this case, as still if a garbage document like that is created there is no field that is useful in order to commit it as the orginal document is already in the index.

essiembre commented 5 years ago

Is it the image or the HTML snippet you want to make unique? If the image, you can extract the "src" value from the HTML snippet using something like ReplaceTagger and make it your sourceReferenceField. But what if another page links to the same image, but with different attribute values? Which one will you keep?

If you capture the images just fine right now but if you want to get rid of the parent document with the weird id, I suggest you filter it out after the split, by applying a regular expression on a value that identifies those. For instance, you could filter out documents with a document.reference that has > img in it.

Does that give you a valid option?