Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

sitemap.xml metadata #509

Closed cherlo closed 3 years ago

cherlo commented 6 years ago

Does the collector add metadata (title, keywords, etc.) it find from the sitemap.xml itself or just metadata it finds inside the document itself?

essiembre commented 6 years ago

Not sure I grasp your question. sitemap.xml files normally do not have title, keyword, and related fields. For every link it follows, it will index the target page with its metadata.

It will also add these fields to each documents from the sitemap:

collector.sitemap-lastmod
collector.sitemap-changefreq
collector.sitemap-priority

Does that answer?

cherlo commented 6 years ago

I think so. I'm assuming that it wont parse SOE elements in sitemap.xml like:

Xero Xero online accounting software. If not, then we will have an issue with indexing web pages with embedded video links. The MP4 have metadata itself so those links will have the title and description data. But the parent page is just a generic template that is passed a video id. So let's say the page hosting the mp4 is http://mysite.com/playVideo.html and it has a link in the page called http://video.3rdparty.com/36136564246.mp4. And the video has a title metadata element of "Xero". I want the playVideo.html to come up in my search results, not the actual video. So, is there a way to tell Norconex to add the metadata that came from a video link in it's page?
essiembre commented 6 years ago

If the info is in your sitemap.xml, one way to do it is to consider the sitemap a regular page and you extract its links using a custom ILinkExtractor. Your custom solution will extract links which can have a title and description attached, which will be associated with the target URLs.

cherlo commented 6 years ago

Thanks. We will look at extending the code.

cherlo commented 6 years ago

If I implement a ILinkExtractor would I use Link.setTitle() and setText() to inject the info into the link? Or do these properties get overwritten by the collector?

essiembre commented 6 years ago

The answer is... none of the above! :-)

They are added as new fields to target documents that are processed, They are:

collector.referrer-reference
collector.referrer-link-tag
collector.referrer-link-text
collector.referrer-link-title

If you want those to take over some other fields you have, I suggest you use RenameTagger or CopyTagger in your Importer section as a post-parse handler.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.