Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 67 forks source link

norconex-collector-http #363

Closed shreya-singh-tech closed 4 years ago

shreya-singh-tech commented 7 years ago

also i need to make a crawler which starts off from google.com and would go through each page for a particular word given as the search criteria and extract just the plain text from lets say the first 6 search results. this is my code as of now <?xml version="1.0" encoding="UTF-8"?>

#set($workdir = "/home/Downloads/norconex-collector-http-2.7.1") ${workdir}/progress ${workdir}/logs https://www.google.co.in/ ${workdir} 1 2 true 404 /report/path/ brokenLinks .*/login/.* .*/login/.* .*apple.*

this needs a bit of editing as i want the only the text part of each page and not the image or gifs or tables.could you suggest some edit?

essiembre commented 7 years ago

The HTTP Collector will extract just the raw text when parsing HTML which seems to be what you are after. Are you experiencing anything different?

If you do not want images to be crawled, remove your <tag name="img" attribute="src" /> entry from the GenericLinkExtractor.

shreya-singh-tech commented 7 years ago

no i wanted to make sure that it does after all those texts which has the word "apple" in order by which they appear on the google page.will this code work for that?

essiembre commented 7 years ago

It looks like it should work, but the way to find out is to give it a try. :-)

The pages should be crawled in the order they are found, but given different threads are used by default, there is no guarantee. Reducing maxThreads to 1 could help there.

shreya-singh-tech commented 7 years ago

okay thank you. i'll get back to you after giving it a shot

shreya-singh-tech commented 7 years ago

it isnt working! output is : Invalid configuration file path: /home/iitp/Downloads/norconex-collector-http-2.7.1/document-collector.xml the path is correct but it cannot seem to read the file.

shreya-singh-tech commented 7 years ago

this is my xml file(there are some changes from the previous one)

?xml version="1.0" encoding="UTF-8"?>

./document-collector/output/progress ./document-collector/output/logs https://www.google.co.in/?gws_rd=ssl#q=apple ./document-collector/output 1 2 true 404 /report/path/ brokenLinks .*/login/.* .*/login/.* .*apple.* < title,keywords,description,document.reference ./document-collector/output/docs
shreya-singh-tech commented 7 years ago

okay so i renamed the file properly once again so it isnt showing any error; but there is no output/docs file getting created.all other log files etc are being created except the crawled data;it seems like it is not getting inside each link of google as i want it to.this is a bit edited version

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="Collect Documents">
      <progressDir>./document-collector/output/progress</progressDir>
    <logsDir>./document-collector/output/logs</logsDir>
        <crawlers>
    <crawler id="extract documents">
        <startURLs>
            <url>https://www.google.co.in/?gws_rd=ssl#q=apple</url>

        </startURLs>
        <workDir>./document-collector/output</workDir>

       <!-- Increase to match your site. -->
       <maxDepth>1</maxDepth>

       <!-- Hit interval, in milliseconds. -->
       <delay default="1000" />

       <numThreads>2</numThreads>

       <robotsTxt ignore="true" />

        <keepDownloads>true</keepDownloads>

        <crawlerListeners>
            <listener  
            class="com.norconex.collector.http.crawler.event.impl.URLStatusCrawlerEventListener">
            <statusCodes>404</statusCodes>
            <outputDir>/report/path/</outputDir>
            <fileNamePrefix>brokenLinks</fileNamePrefix>
        </listener>
        </crawlerListeners>

        <referenceFilters>
            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="exclude">
            .*/login/.*
     </filter> 
        </referenceFilters>

        <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" /> 
        <metadataFilters>
             <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="exclude">
            .*/login/.*
     </filter>
        </metadataFilters>

        <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
    detectContentType="true" detectCharset="true"/>

        <linkExtractors>
            <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
            <tags>
            <tag name="a" attribute="href" />
            <tag name="frame" attribute="src" />
            <tag name="iframe" attribute="src" />
            <tag name="meta" attribute="http-equiv" />
            <tag name="script" attribute="src" />
            </tags>
    </extractor>
        </linkExtractors>

        <importer>
            <!-- refer to Importer documentation -->
            <preParseHandlers>
        <!-- These tags can be mixed, in the desired order of execution. -->
        <tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
        <characterCase fieldName="title" type="lower" applyTo="field" />
         <characterCase fieldName="title" type="string" applyTo="value" />
        </tagger>

         <filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
          onMatch="include" >
         <regex>.*apple.*</regex>
    </filter>
    </preParseHandlers>  
    <postParseHandlers><!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
<tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger"><fields>title,keywords,description,document.reference</fields></tagger></postParseHandlers>      
    </importer>
    <committer class="com.norconex.committer.core.impl.FileSystemCommitter"><directory>./document-collector/output/docs</directory></committer>
            </crawler>
     </crawlers>
 </httpcollector>
essiembre commented 7 years ago

The start URL you are using will load a page that relies heavily on JavaScript to generate the links. If you view the source, you will see the links are not there. To process JavaScript-generated content, you can use the PhantomJSDocumentFetcher.

I would recommend a simpler approach. You can bring a version of Google results that does not rely on JavaScript. To do so, try changing your start URL to the following:

https://www.google.co.in/search?q=apple

Finally, I noticed your config is loading Google sitemap.xml. Since you probably do not want to crawl Google site, I recommend you set this:

<sitemapResolverFactory ignore="true" />
shreya-singh-tech commented 7 years ago

what if i do not want wikipedia sites to be crawled? i know the response would be to use httpURlFilter nd exclude wikipedia sites but it isnt working

essiembre commented 7 years ago

You can add a new RegexReferenceFilter filter to your referenceFilters. If that is not working, please paste what you tried.

shreya-singh-tech commented 7 years ago

this is what I tried after your suggestion:

<?xml version="1.0" encoding="UTF-8"?>
<httpcollector id="doc collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./doc/output/progress</progressDir>
  <logsDir>./doc/output/logs</logsDir>

  <crawlers>
    <crawler id="extract documents">
        <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
            <url>https://www.google.co.in/?hl=en#q=india+is+a+big</url>
    </startURLs>
      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./doc/output</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>1</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

    <referenceFilters>
            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="exclude">
            .*/login/.*
     </filter>

     <filter class="com.norconex.collector.http.filter.impl.RegexReferenceFilter"
                onMatch="exclude" >
          http://en\.wikipedia\.org/wiki/.*
        </filter>
        </referenceFilters>

      <!-- Document importing -->
      <importer>

    <preParseHandlers>

    <!-- to filter on URL extension: -->
    <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
        onMatch="include" field="document.reference">
      .*(pdf|xls|xlsx|doc|docx|ppt|pptx)$
    </filter>

    <!-- to filter on content type (probably best if your URLs do not always have an extension): -->
    <filter class="com.norconex.importer.handler.filter.impl.RegexMetadataFilter"
        onMatch="include" field="document.contentType">
      (application/pdf|anotherOne|yetAnotherOne|etc)
    </filter>

  </preParseHandlers>

        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./doc/output</directory>
      </committer>

     </crawler>
  </crawlers>

</httpcollector>

whixh generates the following error:

com.norconex.collector.core.CollectorException: Cannot load crawler configurations.
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:90)
    at com.norconex.collector.core.AbstractCollectorConfig.loadFromXML(AbstractCollectorConfig.java:304)
    at com.norconex.collector.core.CollectorConfigLoader.loadCollectorConfig(CollectorConfigLoader.java:78)
    at com.norconex.collector.core.AbstractCollectorLauncher.loadCommandLineConfig(AbstractCollectorLauncher.java:140)
    at com.norconex.collector.core.AbstractCollectorLauncher.launch(AbstractCollectorLauncher.java:92)
    at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:75)
Caused by: com.norconex.commons.lang.config.ConfigurationException: This class could not be instantiated: "com.norconex.collector.http.filter.impl.RegexReferenceFilter".
    at com.norconex.commons.lang.config.XMLConfigurationUtil.newInstance(XMLConfigurationUtil.java:200)
    at com.norconex.commons.lang.config.XMLConfigurationUtil.newInstance(XMLConfigurationUtil.java:277)
    at com.norconex.commons.lang.config.XMLConfigurationUtil.newInstance(XMLConfigurationUtil.java:241)
    at com.norconex.commons.lang.config.XMLConfigurationUtil.newInstance(XMLConfigurationUtil.java:169)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadReferenceFilters(AbstractCrawlerConfig.java:378)
    at com.norconex.collector.core.crawler.AbstractCrawlerConfig.loadFromXML(AbstractCrawlerConfig.java:304)
    at com.norconex.commons.lang.config.XMLConfigurationUtil.loadFromXML(XMLConfigurationUtil.java:456)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfig(CrawlerConfigLoader.java:120)
    at com.norconex.collector.core.crawler.CrawlerConfigLoader.loadCrawlerConfigs(CrawlerConfigLoader.java:80)
    ... 5 more
Caused by: java.lang.ClassNotFoundException: com.norconex.collector.http.filter.impl.RegexReferenceFilter
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:264)
    at com.norconex.commons.lang.config.XMLConfigurationUtil.newInstance(XMLConfigurationUtil.java:198)
    ... 13 more

what should i do?

essiembre commented 7 years ago

Replace http with core in the class attribute, just like you did for your first filter.

shreya-singh-tech commented 7 years ago

since i was having so many issues with the filter;i decided to discard all of them and just move forward with the basic crawling that was going on initially;but suddenly it isunable to crawl from google ; if i give a specific url the crawling is working ;but if the start url is google it is not;in the terminal the url is being identified but after that sitemap closes immedialtely and there are no results this is my configuration:

<?xml version="1.0" encoding="UTF-8"?>

<httpcollector id="doc collector">

  <!-- Decide where to store generated files. -->
  <progressDir>./doc/output/progress</progressDir>
  <logsDir>./doc/output/logs</logsDir>

  <crawlers>
    <crawler id="extract documents">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.google.co.in/?gws_rd=ssl#q=apple+fruit</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./doc/output</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>2</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./doc/output</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 7 years ago

See my previous answer to this ticket that explains why you get nothing, and how to change your start URL so you get something.

Also, when you make config changes you may want to start fresh by deleting your "workdir" (/doc/output) in your case.

shreya-singh-tech commented 7 years ago

sir, it is still not working. i tried with some other search engines like bing,yet it failed

essiembre commented 7 years ago

With the modified start URLs it is still not working? Do you get errors? What does the log tell you? I tested with your original config and modified start URL and it was working for me.

shreya-singh-tech commented 7 years ago

Sir it doesnt show any error .The start URL is detected but then immediately the next line is closing sitemap and it gets over.All other files are created in the output directory but no crawled data is found.

essiembre commented 7 years ago

Can you attach your last config with modified start URL to reproduce?

shreya-singh-tech commented 7 years ago

I dont have it with me right now .I'll attach it tommorow morning

shreya-singh-tech commented 7 years ago

sorry for the delay sir

<?xml version="1.0" encoding="UTF-8"?>

  <httpcollector id="doc collector">

   <!-- Decide where to store generated files. -->
  <progressDir>./docs/output/progress</progressDir>
  <logsDir>./docs/output/logs</logsDir>

    <crawlers>
    <crawler id="extract documents">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs stayOnDomain="true" stayOnPort="true" stayOnProtocol="true">
        <url>https://www.google.co.in/search?q=apple</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./docs/output</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>2</maxDepth>

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

      <!-- Document importing -->
      <importer>
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./docs/output</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>
essiembre commented 7 years ago

Without trying it, I see a problem with this: stayOnDomain="true" stayOnPort="true" stayOnProtocol="true".

This will force the crawler to only crawl Google pages and may be part of your problem.

shreya-singh-tech commented 7 years ago

i removed it but the same problem remains

essiembre commented 7 years ago

I gave it a try and could see in the logs links are rejected by Google robots.txt rules. To bypass that, you can set somewhere in your crawler section:

   <robotsTxt ignore="true" />

To have more information in the logs, you can change the log level of crawler events by setting DEBUG to as many entries as you want in the log4j.properties file. Look for lines starting with log4j.logger.CrawlerEvent..

shreya-singh-tech commented 7 years ago

it seems that was the problem.Pages are getting extracted now but the problem is only google pages are being generated.this is a page among all the ones that have been collected 1500017608331000000-add.txt

my configuaration is this:

  <?xml version="1.0" encoding="UTF-8"?>
  <httpcollector id="doc collector">
    <!-- Decide where to store generated files. -->
    <progressDir>./docs-col/output/progress</progressDir>
    <logsDir>./docs-col/output/logs</logsDir>
    <crawlers>
    <crawler id="extract documents">

      <!-- Requires at least one start URL (or urlsFile). 
           Optionally limit crawling to same protocol/domain/port as 
           start URLs. -->
      <startURLs>
        <url>https://www.google.co.in/search?q=apple</url>
      </startURLs>

      <!-- === Recommendations: ============================================ -->

      <!-- Specify a crawler default directory where to generate files. -->
      <workDir>./docs-col/output</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>2</maxDepth>
      <numThreads>2</numThreads>
       <robotsTxt ignore="true" />

      <!-- We know we don't want to crawl the entire site, so ignore sitemap. -->
      <sitemapResolverFactory ignore="true" />

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="5000" />

    <referenceFilters>
            <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="exclude">
            .*/login/.*
     </filter>

        </referenceFilters>

    <metadataFetcher class="com.norconex.collector.http.fetch.impl.GenericMetadataFetcher" /> 

    <metadataFilters>
             <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
            onMatch="exclude">
            .*/login/.*
     </filter>
        </metadataFilters>

        <documentFetcher class="com.norconex.collector.http.fetch.impl.GenericDocumentFetcher"
    detectContentType="true" detectCharset="true"/>

    <linkExtractors>
        <extractor class="com.norconex.collector.http.url.impl.GenericLinkExtractor">
        <tags>
        <tag name="a" attribute="href" />
        <tag name="frame" attribute="src" />
        <tag name="iframe" attribute="src" />
        <tag name="meta" attribute="http-equiv" />
        <tag name="script" attribute="src" />
        </tags>
</extractor>
    </linkExtractors>

    <!-- Document importing -->
      <importer>
    <preParseHandlers>
    <!-- These tags can be mixed, in the desired order of execution. -->
    <tagger class="com.norconex.importer.handler.tagger.impl.CharacterCaseTagger">
    <characterCase fieldName="title" type="lower" applyTo="field" />
     <characterCase fieldName="title" type="string" applyTo="value" />
    </tagger>

     <filter class="com.norconex.importer.handler.filter.impl.RegexContentFilter"
      onMatch="include" >
     <regex>.*apple.*</regex>
    </filter>
    </preParseHandlers> 
        <postParseHandlers>
          <!-- If your target repository does not support arbitrary fields,
               make sure you only keep the fields you need. -->
          <tagger class="com.norconex.importer.handler.tagger.impl.KeepOnlyTagger">
            <fields>title,keywords,description,document.reference</fields>
          </tagger>
        </postParseHandlers>
      </importer> 

      <!-- Decide what to do with your files by specifying a Committer. -->
      <committer class="com.norconex.committer.core.impl.FileSystemCommitter">
        <directory>./docs-col/output</directory>
      </committer>

    </crawler>
  </crawlers>

</httpcollector>

why arent the pages which come in search results of apple being crawled?

essiembre commented 7 years ago

Even if you let it run for some time? Because when I try it I eventually get other pages.

You can add reference filters matching pages you do not want to follow. Just be careful not to exclude your start URL.

Also, if you want to exclude pages only after they have been downloaded and their links have been extracted, you can rely on document filters instead. Like this:

<documentFilters>
  <filter class="com.norconex.collector.core.filter.impl.RegexReferenceFilter"
          onMatch="exclude">
      .*google.*
  </filter> 
</documentFilters>

This flow diagram may help you understand better what happens when.