Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

URL interpret issue #540

Closed dtcyad1 closed 5 years ago

dtcyad1 commented 5 years ago

Hi,

I have several urls with this format. Is the hypen causing the issue? How do i get around it? I cannot change the url formats.

website_test: 2018-12-02 18:56:08 INFO -          DOCUMENT_FETCHED: https://test.com/information-technology/it-blog/-/blogs/flying-phish
website_test: 2018-12-02 18:56:08 INFO -       CREATED_ROBOTS_META: https://test.com/information-technology/it-blog/-/blogs/flying-phish
website_test: 2018-12-02 18:56:08 INFO -            REJECTED_ERROR: https://test.com/information-technology/it-blog/-/blogs/flying-phish (com.norconex.commons.lang.url.URLException: Could not interpret URL: javascript:wnxf_showForm('wnxf_postReplyForm1', '_33_wnxf_postReplyBody1'); wnxf_hideForm('wnxf_editForm1', '_33_wnxf_editReplyBody1', 'Good\x20to\x20know\x2c\x20thank\x20you\x2c\x20Elizabeth\x21');)
website_test: 2018-12-02 18:56:08 INFO - website_test: Could not process document: https://test.com/information-technology/it-blog/-/blogs/flying-phish (Could not interpret URL: javascript:wnxf_showForm('wnxf_postReplyForm1', '_33_wnxf_postReplyBody1'); wnxf_hideForm('wnxf_editForm1', '_33_wnxf_editReplyBody1', 'Good\x20to\x20know\x2c\x20thank\x20you\x2c\x20Elizabeth\x21');)

Thanks

dtcyad1 commented 5 years ago

Hi Pascal,

to add to that..

It is fetching the doc as can be seen from the above logs. Inside of the document, there are href elements that start with javascript. I tried adding this..

<extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
        <schemes>http,https,javascript</schemes>
      </extractor>

but that did not work either. I dont want the javascript href to be processed - but i do want the original page to be processed and ignore the href that starts with "javascript". Is there something i have missed that can do it or do i have to write a custom LinkExtractor (assuming thats where the links are processed in this case) - once this error happens, it skips the current page totally.

com.norconex.commons.lang.url.URLException: Could not interpret URL: javascript:mgdy_showForm('mgdy_postReplyForm0', '_33_mgdy_postReplyBody0');
    at com.norconex.commons.lang.url.HttpURL.<init>(HttpURL.java:113)
    at com.norconex.commons.lang.url.HttpURL.<init>(HttpURL.java:80)
    at com.norconex.collector.http.crawler.URLCrawlScopeStrategy.isInScope(URLCrawlScopeStrategy.java:123)
    at com.norconex.collector.http.pipeline.importer.LinkExtractorStage.executeStage(LinkExtractorStage.java:89)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:31)
    at com.norconex.collector.http.pipeline.importer.AbstractImporterStage.execute(AbstractImporterStage.java:24)
    at com.norconex.commons.lang.pipeline.Pipeline.execute(Pipeline.java:91)
    at com.norconex.collector.http.crawler.HttpCrawler.executeImporterPipeline(HttpCrawler.java:361)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextQueuedCrawlData(AbstractCrawler.java:538)
    at com.norconex.collector.core.crawler.AbstractCrawler.processNextReference(AbstractCrawler.java:419)
    at com.norconex.collector.core.crawler.AbstractCrawler$ProcessReferencesRunnable.run(AbstractCrawler.java:820)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.MalformedURLException: unknown protocol: javascript
    at java.net.URL.<init>(URL.java:600)
    at java.net.URL.<init>(URL.java:490)
    at java.net.URL.<init>(URL.java:439)
    at com.norconex.commons.lang.url.HttpURL.<init>(HttpURL.java:111)

Thanks -dtcyad1

essiembre commented 5 years ago

Can you share your config with the URL of the rejected page to reproduce? It looks like the page is rejected if there are javascript URLs it cannot interpret, but these URLs should not be captured if you take out the "javascript" scheme.

dtcyad1 commented 5 years ago

Hi Pascal, thats what i thought too - by default, the GenericLinkExtractor will be used by default even if i don't add it to the config file - is that correct? That by default does not have the javascript protocol in there, so it should work out of the box directly and not process them by default..

Here is my config file:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xml>
<!-- 
   Copyright 2010-2017 Norconex Inc.

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
-->
<httpcollector id="website_test">

    #set($workdir = "workdir")

  <!-- Decide where to store generated files. -->
  <progressDir>$workdir/progress</progressDir>
  <logsDir>$workdir/logs</logsDir>

  <crawlers>
    <crawler id="website_test">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
          <url>https://test.com/information/blog</url>
      </startURLs>

      <documentFetcher>
        <validStatusCodes>200,301,302</validStatusCodes>
      </documentFetcher> 

      <linkExtractors>
       <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
          <schemes>http,https</schemes>
        </extractor>
      </linkExtractors>

      <referenceFilters>
        <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
                    onMatch="include">
            https://test.com/information/blog.*
        </filter> 
</referenceFilters>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>$workdir</workDir>

      <maxDepth>-1</maxDepth>
      <maxDocuments>-1</maxDocuments>
      <canonicalLinkDetector ignore="false" />
      <sitemapResolverFactory ignore="true" />
      <orphansStrategy>DELETE</orphansStrategy>
      <delay default="0" />
      <numThreads>1</numThreads>

      <referenceFilters>
    <filter class="com.norconex.collector.core.filter.impl.ExtensionReferenceFilter" onMatch="exclude">jpg,gif,png,ico,css,js,svg</filter>
      </referenceFilters>

      <!-- DEBUG ONLY -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
                  <directory>$workdir/crawledFiles</directory>
          </committer> 
      <importer>

    <postParseHandlers>
      <tagger class="com.norconex.importer.handler.tagger.impl.RenameTagger">
            <rename fromField="og:site_name" toField="website" overwrite="true" />
            <rename fromField="og:type" toField="sourcetype" overwrite="true" />
            <rename fromField="og:url" toField="targeturl" overwrite="true" />
          </tagger>
        </postParseHandlers>
      </importer>

    </crawler>
  </crawlers>

</httpcollector>

Thanks

essiembre commented 5 years ago

I cannot reproduce. I am afraid you will have to share the real URL to https://test.com/information-technology/it-blog/-/blogs/flying-phish. If sensitive, you can send it to me by email, making sure to reference this ticket.

dtcyad1 commented 5 years ago

Hi Pascal, unfortunately - that is behind a fire wall. The only way i could resolve this was by by creating my own LinkExtractor, which is basically a copy, but added thses lines:

if (!link.getUrl().startsWith("javascript")) { links.add(link); }

in the extractLinks method. Probably not the best way for me but it does get the work done!!

This prevents the javascript link from being added for further processing.

You can close this issue for now.

Thanks

essiembre commented 5 years ago

Before I close, I would like to reproduce. Can you please share your HTTP Collector version? Can you also give an HTML snippet where such a link is found? For example, I tried with this HTML:

<a href="javascript:some_function('some_arg', 'another_arg');">Must not be extracted</a>

GenericLinkExtractor does not attempt to add that one.

I suspect you may be using an older version or your <a ...> tag has something else to it.

dtcyad1 commented 5 years ago

Hi Pascal,

here is the JS html output::

<span class="test-link">
  <a href="javascript&#x3a;abcd_Comments&#x28;true&#x29;&#x3b;" class=" tag-msg" id="_33_a__col1__1" >
    <img id="a__col1__1" src="https://test.com/spacer.png"  alt="" style="background-image: url('https://test.com/_test.png'); background-position: 50% -1558px; background-repeat: no-repeat; height: 16px; width: 16px;" />
    <span class="taglib-text ">Test Comments</span>
  </a>
</span>

Can you please see if there is anything on here that causes the javascript to be ignored?

Thanks

essiembre commented 5 years ago

I finally was able to reproduce and I made a fix for it in the latest snapshot. Such invalid URLs should no longer force a crawler execution to stop. Please give it a try and confirm.

dtcyad1 commented 5 years ago

Hi Pascal,

it works now!!

Thanks