Closed doaa-khaled closed 8 years ago
Sort of, yes, if you think of it in reverse. :-)
Since there can be many pages linking to the same page, there is no single "parent" page that is stored with each crawled pages. This could become a feature request, but I am not sure how reliable it would be given the "parent" that led to a child may change from one crawl to another.
On the other hand, there is a way to find out ALL parent pages for a single page but it may not be as straightforward to get depending how you store the crawled information.
When a page is crawled, all extracted URLs are stored with it in a field called collector.referenced-urls
. So if you store that multi-value field along with the rest of your document data, you can search for a URL in that field and you will find all pages that has that URL in them.
Does this work for you?
no, i tried it and the result I get was links don't refer to the fetched link and i noticed that all are not found here is my configuration
<postParseHandlers>
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
<fields>
title,keywords,description,document.reference,document.contentType,collector.referenced-urls
</fields>
</tagger>
</postParseHandlers>
and that was the result sample I got
document.reference=http\://***/help/../help/../help/../help/../help/catalog/ndtm.pdf
description=NIC Ship
collector.referenced-urls=http\://***/products/seriesAZ.asp^|~http\://***/help/faq.asp^|~http\://***/products/images/email.png^|~http\://***/privacy/index.asp^|~http\://***^|~http\://***/includes/twitter2.jpg^|~http\://***/livehelp/include/status.php^|~http\://***/help/links.asp^|~http\://***/includes/NIC_LOGO.jpg^|~http\://***/help/../help/../help/../help/../help/catalog/ndtm.pdf^|~http\://***/nic_forms/contactus.asp^|~http\://***/products/images/favorites.png^|~http\://***/images/spacer.gif^|~http\://***/nic_forms/newsletter.asp^|~http\://***/images/e-news1.gif^|~http\://***/products/images/print.png^|~http\://***/sales/contacts.asp
title=*** - Page Not Found
From what you pasted, the URLs in collector.referenced-urls
are children of document.reference
so you have the relationship there.
In this specific case, it seems your PDF URL was not found and a "Page Not Found" HTML page was return instead. That error page probably had the links you see in collector.referenced-urls
. Normally page not found return a HTTP response code of 404. In your case if it was crawled like any other documents it is likely becasue the response code was not 404 but instead a "valid" status code. You may want to report this to the site owner, or you may want to explicitely filter "not found" documents some other way.
but is there a way to get this relationship straight forward instead ? plus I didn't get the value of "collector.referenced-urls" only when the page be not found, valid pages don't has a value to it!
Yes there is, I was focused on finding all parents of a page for some reason, given there could always be more than one.
If what you are after is to keep the "referrer" URL only, this can be done by telling the GenericLinkExtractor you want to keep referrer data, like this:
<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" keepReferrerData="true"/>
Then you will end up with a bunch of new "parent" metadata fields:
Refer to the above link for what each of them are.
Does this work for you?
yes it works for me now .. thank you :)
I wonder if there is a way to get the url which contains the fetched url.