Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

Is it possible to access 'document' in ICrawlerEventListener? #689

Closed LeMoussel closed 4 years ago

LeMoussel commented 4 years ago

Hi,

I wrote a ICrawlerEventListener, something like that:

@Override
public void crawlerEvent(ICrawler crawler, CrawlerEvent event) {
   if(CrawlerEvent.REJECTED_BAD_STATUS.equals(event.getEventType())){
      // Do some stuff to access `HttpDocument`
      // Is it possible ?
   }
}

Is it possible to access HttpDocument?

essiembre commented 4 years ago

The HttpDocument is not available to this listener for that event. What property are you trying to access? Maybe there is a way to get what you are after.

For a REJECTED_BAD_STATUS, the event.getCrawlData() method will return an instance of HttpCrawlData, which identifies the URL at fault and more. Also, the event.getSubject() method, in this case, will return an instance of HttpFetchResponse which has the failing HTTP status code. Could one of these methods get you what you are after?

LeMoussel commented 4 years ago

I want to acces to metadata of HttpDocument for all REJECTED_ event. With HttpCrawlData & HttpFetchResponse I can't access it.

essiembre commented 4 years ago

I am afraid for the majority of rejected documents this is all you will get. Most of the document metadata is obtained when the document is parsed by the importer. For this, it has to be downloaded. Most rejections are because the document could not be downloaded, had errors when parsing it, or are filters preventing documents from being downloaded/processed. Without a document being downloaded (and properly parsed), there is no file to extract metadata on.