Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

Convert anchor relative urls to full #154

Closed yvesnyc closed 8 years ago

yvesnyc commented 8 years ago

Hi,

When Norconex finds a relative anchor url such as the following snippet in http://www.mpfr.org/:

<a href="mpfr-current/#download">download</a>

it saves the source as: mpfr-current/#download _download

Is there a way to configure Norconex to reconstruct the complete url as

http://www.mpfr.org/mpfr-current/#download or even http://www.mpfr.org/mpfr-current/

I know these urls are stored in metadata for the page but i would like to find the complete url in the actual page text location.

Furthermore, in the case of a an embedded frame, <frame src..>, can it do the same, a complete url of an anchor with the frame src prefix? I am assuming the parent page is captured containing the embedded frame and then the frame is captured individually.

Thanks,

Yves

essiembre commented 8 years ago

To keep the fragment with each URLs you can use the keepFragment flag on the GenericLinkExtractor, this way:

<linkExtractors>
  <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor" keepFragment="true"/>
</linkExtractors>

Is this what you were looking for or are you looking for a way to keep the URLs with the content extracted from HTML pages?

yvesnyc commented 8 years ago

Hi Pascal,

It is not the fragment that is the issue. This is a UrlNormalizer type of question. An anchor with a relative URL after extraction looks like a path, i.e., /download. Normally there would have been a protocol, host, and path: http://norconex.com/collectors/collector-http/download. The protocol and host patterns are clear signatures of a URL, a path is not certain. Look at the source in your browser for the collector-http/download page to see some relative URLS.

I could simply take the webpage’s Content-Location as the prefix but I am not sure what would happen in the case of a Frame with a src from another host. A relative URL in the frame is based on the frame src, not the webpage Content-Location.

I could hack a solution and use the ReplaceTransformer solution you gave me for Norconex/Import Issue: [importer] Html elements import (#15), to replace the anchor pattern with a URL_START URL_END bookends to signal where a URL (relative) exists. Then match the URL using .endsWith() against all of the captured URLS in the metadata. However, this would not solve the Frame problem.

The best solution would be a new UrlNormalizer setting to replace relative urls with the full (protocol,host,etc.) prefix. This is preferred since Norconex always has that information handy.

Thanks,

Yves

essiembre commented 8 years ago

But the HTTP Collector already does that. When it extracts URLs, it makes them absolute. Do you have an example where it is not the case? A config sample maybe?

That's why I am wondering if you mean to convert URLs in the document content itself? Like this, assuming we are on http://example.com/path/page1.html:

Original document body:

<div>This is my document content with a <a href="page2.html#whatever">link</a> in it.</div>

Extracted content (currently):

This is my document content with a link in it.

Are you saying you want the URL to appear in the extracted content? Like this (without being clickable):

This is my document content with a http://example.com/path/page2.html#whatever link in it.

Otherwise, if that's not what you are after, then relative URLs should already be handled properly and I may need a config copy with an example of what does not work to understand.

yvesnyc commented 8 years ago

Yes. To convert URLS in the document content itself.

Yes. I want the URL to appear in the extracted content. You got it!

Furthermore, how does Norconex extract content which contains Frames with src URLS?

Yves

essiembre commented 8 years ago

Every URLs are considered distinct documents, so the src URL of an IFrame will be crawled on its own, as a separate document.

As for keeping URLs in the extracted text, there is no simple flag to enable this right now. Either you would have to write your own parser, or try a workaround (or make this a feature request).

You could for instance use a ReplaceTransformer as an importer pre-parse handler to remove the <a href=" that surronds the URL, so the URL appears as plain text.

This will not make relative URLs absolute though. Another approach is to use the new ScriptTransformer and use Javascript to extract the URL root and perform your search and replace on the raw HTML to convert <a ...> to just a plain text URL.

yvesnyc commented 8 years ago

I recognize that my request may not be common and therefore wouldn't be a feature request with high priority. I need a solution now. So I worked on the ScriptTransformer approach to processing html. I created a script as a function in Nodejs and emulated the reference and parsed variables using a closure.

The problem is that when I upgraded to "norconex-collector-http-2.3.0-SNAPSHOT" I discovered my original program was now failing. The problem is that the web page content from crawling had changed.

Here is what the committed data looked like for one URL web page content "Content-Location" : [ "http://redis.io/download" ], "content" : [ "\n \n \n \n \t\n _ / _Home\n \n\t\n _ /commands _Commands\n \n\t\n _ /clients _Clients\n \n\t\n _ /documentation _Documentation\n \n\t\n _ /community _Community\n \n\t\n _ /download _Download\n \n\t\n _ /support _Support\n \n\t\n _ /topics/license _License\n \n\n\n \n \n\n \n \n \n \n \n \n \n _ / _\n \n \n \n\n \n _ / _\n \n \n _ /commands _Commands\n _ /clients _Clients\n _ /documentation _Documentation\n _ /community _Community\n _ /download _Download\n _ /support _Support\n _ /topics/license _License\n \n\n \n \n \n \n Join us in London October 19th for the _ https://www.eventbrite.com/e/2nd-annual-redis-unconference-tickets-18652995612 _Redis Unconference London.\n \n\n \n \n \n _ #download _*Download\n\n \n Redis uses a standard practice for its versioning:\n major.minor.patchlevel.\n An even\n minor\n marks a\n stable\n release, like 1.2, 2.0, 2.2, 2.4, 2.6,\n 2.8. Odd minors are used for\n unstable\n releases, for example 2.9.x releases\n See all ... _ /topics/sponsors _credits.\n \n\n \n Sponsored by\n _ https://redislabs.com/ _\n \n \n \n\n \n\n \n \n\n " ]

Here is what the same page content looks like after upgrading to 2.3.0-SNAPSHOT "Content-Location" : [ "http://redis.io/download" ], "content" : [ "\n \n \n \n \t\n _ https://redislabs.com/ _\n \n \n \n\n\n\n \n \n \n " ] Thats all of it :(

Many other pages had this same truncated output.

I used the same Extension and Regex filters.

val extensionFilt = new ExtensionReferenceFilter("tar,TAR,zip,ZIP,rpm,RPM,gz,GZ,tgz,TGZ,ppt,PPT,mpg,MPG,jpg,JPG,gif,GIF,png,PNG,ico,ICO,css,CSS,js,JS,sit,SIT,eps,EPS,wmf,WMF,xls,XLS,mov,MOV,exe,EXE,jpeg,JPEG,bmp,BMP",com.norconex.importer.handler.filter.OnMatch.EXCLUDE)
val regexFilt = new RegexReferenceFilter("http://([a-z0-9]+\\.)*redis\\.io(/.*)?",com.norconex.importer.handler.filter.OnMatch.INCLUDE)

I switched back and forth between releases to verify this. The only changes after upgrading was the API signatures for setCrawlerListeners, setStartURLs, and setReferenceFilters.

The only other specialization used in both versions is in the config.xml:

 <preParseHandlers>
                <!-- Normally comment out the tagger below -->
                <!--
                <tagger class="com.norconex.importer.handler.tagger.impl.DebugTagger" logLevel="WARN" > </tagger>>
                -->
                <transformer class="com.norconex.importer.handler.transformer.impl.ReplaceTransformer">
                    <replace>
                        <fromValue><![CDATA[<a .*href=['"]([^"']*)['"][^>]*>]]></fromValue>
                        <toValue>_ $1 _</toValue>
                    </replace>
                </transformer>
</preParseHandlers>

Any ideas?

essiembre commented 8 years ago

Nothing jumps at me right now. Can you attach your full config in order to reproduce?

yvesnyc commented 8 years ago

Found the bug in my code. It was a missing '?' in the regex, turning a lazy quantifier to greedy. It was so greedy it ate everything up until the last href. You can see it in my previous comment, in the ReplaceTransformer XML code. Somehow the previous version of Norconex tolerated it.

I also got ScriptTransformer to work! Good stuff. Precludes the need for ReplaceTransformer in my case.

Question: Why do you need the parsed variable? If the ScriptTransformer is in the section, parsed is false always. And the ScriptTransformer does not work outside pre/postParseHandlers.

Feature request: A useful feature would allow configuration of pre-loaded scripts such as jQuery.js or other Javascript libraries. I think it is just a ScriptEngine "eval" call. The programmer will be happy not having to drop down to basic Javascript to do work.

Thanks.

essiembre commented 8 years ago

Great! I am glad to hear you are making good use of this feature (script).

About the parsed variable, it is not useful in most cases, but it is passed anyway, just in case a handler needs it. For instance, someone could implement a handler (or a script) that supports being used both as a pre-parse handler and a post-parse handler, but wants to adopt different behavior for each. In such case it would need to know.

For the pre-loading of scripts, I create the new feature request #160.

essiembre commented 8 years ago

For got to ask... now that this is working for you, can we close this one now?

yvesnyc commented 8 years ago

One more related question. Is there any way to listen for crawler events such as DOCUMENT_PREIMPORTED, and get a string version of the web page? This string could then be transformed and update the crawler's own representation before it is written to disk, etc.

If we can get pre or even postimport web content and be able to transform the string and update the crawler's copy then we could write java code, specifically JSoup, to parse the Html and manipulate it.

essiembre commented 8 years ago

Events are not the way to manipulate the documents. If you want to do so, the best approach usually is to implement your own <transformer> for the Import module (look at existing implementations for examples).

If for some reason you need to do it outside the Import module, you can look at <preImportProcessors> and <postImportProcessors> for the HTTP Collector.

Those options are found in the documentation pages for the Importer configuration and HTTP Collector configuration respectively.

yvesnyc commented 8 years ago

I understand.

I think, for my use, the is the best way.

Thanks again.

On Oct 17, 2015, at 12:00 AM, Pascal Essiembre notifications@github.com wrote:

Events are not the way to manipulate the documents. If you want to do so, the best approach usually is to implement your own for the Import module (look at existing implementations for examples).

If for some reason you need to do it outside the Import module, you can look at and for the HTTP Collector.

Those options are found in the documentation pages for the Importer configuration and HTTP Collector configuration respectively.

— Reply to this email directly or view it on GitHub.