Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Is there a way to get the parent url of fetched url ? #275

Closed doaa-khaled closed 8 years ago

doaa-khaled commented 8 years ago

I wonder if there is a way to get the url which contains the fetched url.

essiembre commented 8 years ago

Sort of, yes, if you think of it in reverse. :-)

Since there can be many pages linking to the same page, there is no single "parent" page that is stored with each crawled pages. This could become a feature request, but I am not sure how reliable it would be given the "parent" that led to a child may change from one crawl to another.

On the other hand, there is a way to find out ALL parent pages for a single page but it may not be as straightforward to get depending how you store the crawled information.

When a page is crawled, all extracted URLs are stored with it in a field called collector.referenced-urls. So if you store that multi-value field along with the rest of your document data, you can search for a URL in that field and you will find all pages that has that URL in them.

Does this work for you?

doaa-khaled commented 8 years ago

no, i tried it and the result I get was links don't refer to the fetched link and i noticed that all are not found here is my configuration

<postParseHandlers>
      <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
           <fields>
title,keywords,description,document.reference,document.contentType,collector.referenced-urls
           </fields>
      </tagger>
</postParseHandlers>

and that was the result sample I got

document.reference=http\://***/help/../help/../help/../help/../help/catalog/ndtm.pdf
description=NIC Ship
collector.referenced-urls=http\://***/products/seriesAZ.asp^|~http\://***/help/faq.asp^|~http\://***/products/images/email.png^|~http\://***/privacy/index.asp^|~http\://***^|~http\://***/includes/twitter2.jpg^|~http\://***/livehelp/include/status.php^|~http\://***/help/links.asp^|~http\://***/includes/NIC_LOGO.jpg^|~http\://***/help/../help/../help/../help/../help/catalog/ndtm.pdf^|~http\://***/nic_forms/contactus.asp^|~http\://***/products/images/favorites.png^|~http\://***/images/spacer.gif^|~http\://***/nic_forms/newsletter.asp^|~http\://***/images/e-news1.gif^|~http\://***/products/images/print.png^|~http\://***/sales/contacts.asp
title=*** - Page Not Found
essiembre commented 8 years ago

From what you pasted, the URLs in collector.referenced-urls are children of document.reference so you have the relationship there.

In this specific case, it seems your PDF URL was not found and a "Page Not Found" HTML page was return instead. That error page probably had the links you see in collector.referenced-urls. Normally page not found return a HTTP response code of 404. In your case if it was crawled like any other documents it is likely becasue the response code was not 404 but instead a "valid" status code. You may want to report this to the site owner, or you may want to explicitely filter "not found" documents some other way.

doaa-khaled commented 8 years ago

but is there a way to get this relationship straight forward instead ? plus I didn't get the value of "collector.referenced-urls" only when the page be not found, valid pages don't has a value to it!

essiembre commented 8 years ago

Yes there is, I was focused on finding all parents of a page for some reason, given there could always be more than one.

If what you are after is to keep the "referrer" URL only, this can be done by telling the GenericLinkExtractor you want to keep referrer data, like this:

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" keepReferrerData="true"/>

Then you will end up with a bunch of new "parent" metadata fields:

Refer to the above link for what each of them are.

Does this work for you?

doaa-khaled commented 8 years ago

yes it works for me now .. thank you :)