Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Where do I find documentation for tagger fields? #424

Closed jerrywithaz closed 6 years ago

jerrywithaz commented 6 years ago

Where do I find information on the fields that are available for the tagger. I.E. the fields that would go here:

<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
     <fields>id,title,keywords,description,content,document.reference, document.contentType</fields>
</tagger>

I'm specifically looking for the status code of the webpage (200, 404, etc...) and the origin of the url (So, if the url was http://www.example.com/page/welcome I'd want http://www.example.com)

I can't find any information online and I've been searching for hours

essiembre commented 6 years ago

The fields obtained vary for each document/site. There are a few ways to find out what was extracted. You can use the FilesystemCommitter and have a look at generated files. A better approach is probably to use the DebugTagger. You can have that tagger multiple times (before and after parsing, other taggers, transformers, etc). It will print out what has been gathered so far.

For example, you can add the following just before your KeepOnlyTagger so you can list all fields so far using the INFO log level:

  <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="INFO"/>

Does that answer?

jerrywithaz commented 6 years ago

Hi, essiembre, that doesn't exactly answer my question. I already have the DebugTagger on and I am not seeing any information related to the status of the url.

I am using AWS CloudSearch as my committer. What I am trying to do is add a "status" field that will contain the status of the page so that I can only search for pages that have a status of 200. I am doing this because AWS CloudSearch doesn't allow you to delete entries and I don't want pages showing up in my search engine that aren't being used any more. So, if you have any other ideas on how I can achieve that I am open!

This is the only information I see from the DebugTagger (I've emitted information for privacy):

INFO  [DebugTagger] Content-Length=16191
INFO  [DebugTagger] collector.referrer-link-tag=a.href
INFO  [DebugTagger] collector.referrer-reference=...
INFO  [DebugTagger] Server=Microsoft-IIS/8.5
INFO  [DebugTagger] X-Powered-By=ASP.NET
INFO  [DebugTagger] Cache-Control=private
INFO  [DebugTagger] collector.referenced-urls=...
INFO  [DebugTagger] X-AspNet-Version=4.0.30319
INFO  [DebugTagger] collector.is-crawl-new=true
INFO  [DebugTagger] collector.referrer-link-text=Our Process
INFO  [DebugTagger] Date=Mon, 13 Nov 2017 16:29:17 GMT
INFO  [DebugTagger] Vary=Accept-Encoding
INFO  [DebugTagger] document.reference=...
INFO  [DebugTagger] collector.content-type=text/html
INFO  [DebugTagger] Content-Encoding=gzip
INFO  [DebugTagger] document.contentEncoding=utf-8
INFO  [DebugTagger] document.contentFamily=html
INFO  [DebugTagger] collector.content-encoding=utf-8
INFO  [DebugTagger] Content-Type=text/html; charset=utf-8
INFO  [DebugTagger] document.contentType=text/html
INFO  [DebugTagger] collector.depth=1
INFO  [DebugTagger] title=...
INFO  [DebugTagger] collector.referrer-link-text=Our Process
INFO  [DebugTagger] description=...
INFO  [DebugTagger] keywords=...
INFO  [DebugTagger] document.reference=...

Do you have any other ideas on how I can get that information to be added to the document data?

Here is my config in case that information would help: `<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE xml>

./results/progressdir ./results/logsDir url 1 url 2 ./results/workDir 100 10000 200 DELETE title,keywords,description,content,document.reference, document.contentType, collector.referrer-link-text png,gif,jpg,jpeg,js,css,json,svg, xlsx, xls, pptx, ppt .*/runtime/.* .*/design/.* .*/templates/.* .*/error/.* true url goes here

`

essiembre commented 6 years ago

If it is to track deleted/bad documents, it is handled automatically. As long as you do not delete the crawl cache (i.e., you "workdir"), a document that can no longer be reached will trigger a deletion request for that document on CloudSearch. So you will only get "200" ones indexed. Do you have evidence to the contrary?

If you want to track bad links so you can fix them on your site or similar reasons, you can create a report each time by using URLStatusCrawlerEventListener.

jerrywithaz commented 6 years ago

Hmm... Okay, I have been deleting the workdir before each crawl so maybe that's the issue. What if a page is a 302 (redirect)_ and not a 400 error, will it still delete them? The client that I'm working with handles their 404s by redirecting to an error page. So, broken pages actually return a 302. I can't seem to get those to be deleted. But, maybe it's because I've been delete the working directory.

I added 302s to my notFoundStatusCodes as such:


            <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher">
                <validStatusCodes>200</validStatusCodes>
                <notFoundStatusCodes>404,302</notFoundStatusCodes>
            </metadataFetcher>

            <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher">
                <validStatusCodes>200</validStatusCodes>
                <notFoundStatusCodes>404,302</notFoundStatusCodes>
            </documentFetcher>

I'm also using the spoiledReference Stratgizer with the follow settings:

<spoiledReferenceStrategizer class="com.norconex.collector.core.spoil.impl.GenericSpoiledReferenceStrategizer">
                <mapping state="NOT_FOUND" strategy="DELETE"/>
                <mapping state="BAD_STATUS" strategy="DELETE"/>
                <mapping state="ERROR" strategy="DELETE"/>
            </spoiledReferenceStrategizer>

However, when I run the crawler I see that the page get's rejected if it is a 302 but it doesn't get delete from the search domain or get proccessed. Am I doing something wrong? Or is what I'm trying to achieve not possible currently?

The log looks as such: INFO [CrawlerEventManager] REJECTED_NOTFOUND: https://www.example.com/page

essiembre commented 6 years ago

Deletions are tracked on incremental crawls only. That is, the crawler will only send deletion requests for documents that we previously OK, and turned bad on subsequent crawl. It uses an internal cache to figure that out (crawl store). When you clean the workdir, you delete that cache, so everything is "fresh" and you will not get deletion requests on a fresh run.

In addition to handling deletions, incremental crawls will only send the valid documents that have changed to your Committer.

jerrywithaz commented 6 years ago

Thanks, essiembre that's what I realized. Everything is actually working fine now, I just can't delete my old documents because, like you said, I deleted the workdir. But, luckily the cloudsearch domain I'm using is just for testing purposes. Thanks for you help!

essiembre commented 6 years ago

No problem! FYI, ticket #211 has a feature request to send deletion requests to the Committer for documents with bad status. Once implemented, that new feature will not work for documents that are no longer discoverable through crawling so most 404 will not be caught. One workaround for occasions when you need to start crawling from scratch would be to create a seed file with all URLs extracted from your search engine. This would have all of them crawled for potential deletions and would clean up your search index. Can you see this being useful/practical?

jerrywithaz commented 6 years ago

Hey, sorry! I just saw your reply, yes I can see that as being a very useful feature. Thanks for putting in the request