Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.
https://opensource.norconex.com/crawlers
Apache License 2.0
183 stars 68 forks source link

FileNotFoundException on mapdb.t #21

Closed Nycander closed 10 years ago

Nycander commented 10 years ago

I'm running on

This exception (se below) occurs quite frequently in the logs. It seems to happen farily random and I'm not sure how to reproduce it.

    java.io.IOError: java.io.FileNotFoundException: ..\crawler-output\XYZ\crawldb\XYZ\mapdb.t (Access is denied)
            at org.mapdb.Volume$FileChannelVol.<init>(Volume.java:671)
            at org.mapdb.Volume.volumeForFile(Volume.java:183)
            at org.mapdb.Volume$1.createTransLogVolume(Volume.java:218)
            at org.mapdb.StoreWAL.openLogIfNeeded(StoreWAL.java:108)
            at org.mapdb.StoreWAL.put(StoreWAL.java:215)
            at org.mapdb.Caches$WeakSoftRef.put(Caches.java:429)
            at org.mapdb.Queues$Queue.add(Queues.java:373)
            at com.norconex.collector.http.db.impl.mapdb.MappedQueue.add(MappedQueue.java:157)
            at com.norconex.collector.http.db.impl.mapdb.MappedQueue.add(MappedQueue.java:1)
            at com.norconex.collector.http.db.impl.mapdb.MapDBCrawlURLDatabase.queue(MapDBCrawlURLDatabase.java:146)
            at com.norconex.collector.http.crawler.HttpCrawler.deleteCacheOrphans(HttpCrawler.java:255)
            at com.norconex.collector.http.crawler.HttpCrawler.handleOrphans(HttpCrawler.java:221)
            at com.norconex.collector.http.crawler.HttpCrawler.execute(HttpCrawler.java:173)
            at com.norconex.collector.http.crawler.HttpCrawler.startExecution(HttpCrawler.java:147)
            at com.norconex.jef.AbstractResumableJob.execute(AbstractResumableJob.java:52)
            at com.norconex.jef.JobRunner.runJob(JobRunner.java:193)
            at com.norconex.jef.JobRunner.runSuite(JobRunner.java:94)
            at com.norconex.collector.http.HttpCollector.crawl(HttpCollector.java:198)
            at com.norconex.collector.http.HttpCollector.main(HttpCollector.java:165)
    Caused by: java.io.FileNotFoundException: ..\crawler-output\XYZ\crawldb\XYZ\mapdb.t  (Access is denied)
            at java.io.RandomAccessFile.open(Native Method)
            at java.io.RandomAccessFile.<init>(Unknown Source)
            at org.mapdb.Volume$FileChannelVol.<init>(Volume.java:668)
            ... 18 more     

Here's my main configuration:

<httpcollector id="${name}">

  <!-- Decide where to store generated files. -->
  <progressDir>../crawler-output/${name}/progress</progressDir>
  <logsDir>../crawler-output/${name}/logs</logsDir>

  <crawlers>
    <crawler id="${name}">
      <startURLs>
        <url>${sitemap}</url>
      </startURLs>

      <sitemap ignore="false" class="com.norconex.collector.http.sitemap.impl.DefaultSitemapResolver">
        <location>${sitemap}</location>
      </sitemap>

      <!-- Where the crawler default directory to generate files is. -->
      <workDir>../crawler-output/${name}/</workDir>

      <!-- Put a maximum depth to avoid infinite crawling (e.g. calendars). -->
      <maxDepth>0</maxDepth>

      <!-- Remove pages that are no longer linked to by the sitemap -->
      <deleteOrphans>true</deleteOrphans>

      <!-- Be as nice as you can to sites you crawl. -->
      <delay default="100" />

    <committer class="com.norconex.committer.solr.SolrCommitter">
      <solrURL>http://localhost:8984/solr/sites</solrURL>
        <batchSize>100</batchSize>
        <solrBatchSize>100</solrBatchSize>
        <contentTargetField>content_${language}</contentTargetField>
      </committer>
    </crawler>
  </crawlers>
</httpcollector>
Nycander commented 10 years ago

As a workaround, I'm using the Derby link database implementation. It seems to be working fine

essiembre commented 10 years ago

Thanks for reporting this. I am glad the Derby implementation works fine for you. The down side is performance on large sites. Derby performance does not scale as well as MapDB. But if your site is of small enough size, it may not make a big difference.

It turns out the exception is a bug of the MapDB library used by the HTTP Collector (described here: https://github.com/jankotek/MapDB/issues/274).

My recommendation is to replace the existing lib/mapdb-0.9.8.jar with the latest version here: http://search.maven.org/remotecontent?filepath=org/mapdb/mapdb/0.9.9/mapdb-0.9.9.jar

Let me know if the latest version of that library fixes the problem.

essiembre commented 10 years ago

Another suggestion, is try try increasing your delay. From that same MapDB ticket above:

windows might file locks a few miliseconds after file was closed, 
so we need an loop which would retry to open file for lets say 500ms. 

Since file locking has been worked on in MapDB 0.9.9, I hope it will be sufficient for you to simply upgrade that lib.

jankotek commented 10 years ago

Just to let you know MapDB 0.9.10 was released. Some users reported unreleased file locks on windows. That should be solved now.

essiembre commented 10 years ago

Awesome! Thanks for the update. Our next release will include 0.9.10. Nycander, can you give it a try and report if it solves your issue?

Nycander commented 10 years ago

Sorry for the delay, but I had to prioritise other things in the project I'm working on.

But now I've had some time to test this out. I've dropped in MapDB 0.9.10 and it seem to be working :)

My hope is that MapDB will have better disk performance than Derby.

essiembre commented 10 years ago

Great, thanks for the feedback. MapDB speed does not compare with Derby. The more documents you attempt to crawl, the more you should see a difference. I'll close this ticket when the next release of Norconex HTTP Collector is out (should be this week).

Nycander commented 10 years ago

I only crawl about 1000 documents, but the hardware is a really bad SAN.

Initial findings seem to indicate that disk performance is much better using MapDB :+1:

jankotek commented 10 years ago

Hi,

there will be out new MapDB release 0.9.11 which changes files handling a lot.

essiembre commented 10 years ago

Thanks for the info @jankotek

Norconex HTTP Collector 1.3 is now out with MapDB 0.9.10.

@Nycander I am closing this ticket but if this issue arise again please open a new one and we shall release a patched version that includes Mapdb 0.9.11 (or newer).