Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

I want to get the content Type (MIME type) for each crawled entry #1038

Closed HailunTan closed 1 month ago

HailunTan commented 3 months ago

As the title suggested, I checked the Norconext crawler. There are only three data fields from the crawled entries:

  1. URL reference
  2. meta-data
  3. content with the HTML stripped.

Could you please advise how I can get the content type of the crawled entry in Norconex crawler? Thanks.

essiembre commented 3 months ago

Have you specified a "Committer"? This is what's reponsible for sending all crawled data to a location of your choice. You may want to start with storing crawled content on the local filesystem as a test while you familiarize yourself. Like using the XMLFileCommitter. File a full list of available committers here: https://opensource.norconex.com/committers/

You'll find in the generated output files that it contains the three items you are seeking (URL, metadata, and text-only content).

The content type may be present in a few different forms depending on the metadata available, but one that should always be there (unless you filtered it out intentionally) is document.contentType. You may also like document.contentFamily which offers simple categorization of documents based on their type.

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.