Closed LeMoussel closed 4 years ago
The HttpDocument is not available to this listener for that event. What property are you trying to access? Maybe there is a way to get what you are after.
For a REJECTED_BAD_STATUS
, the event.getCrawlData()
method will return an instance of HttpCrawlData
, which identifies the URL at fault and more. Also, the event.getSubject()
method, in this case, will return an instance of HttpFetchResponse
which has the failing HTTP status code. Could one of these methods get you what you are after?
I want to acces to metadata of HttpDocument for all REJECTED_ event.
With HttpCrawlData
& HttpFetchResponse
I can't access it.
I am afraid for the majority of rejected documents this is all you will get. Most of the document metadata is obtained when the document is parsed by the importer. For this, it has to be downloaded. Most rejections are because the document could not be downloaded, had errors when parsing it, or are filters preventing documents from being downloaded/processed. Without a document being downloaded (and properly parsed), there is no file to extract metadata on.
Hi,
I wrote a
ICrawlerEventListener
, something like that:Is it possible to access HttpDocument?