Re-crawl doesn't recognize all files already indexed

jmrichardson commented 6 years ago

Hi, I have ~4.7M files already indexed and re-ran the crawler to see how long it would take to crawl on second attempt. The first (initial) crawl took 1 day 10 hours. The second attempt I started last night and it is still running. When I started the crawler with the "-a start", it did detect a previous run and after a bit of time it started and showed the files had been indexed with the "REJECTED_UNMODIFIED" tag in the log file. However, after about 8% of the latest log file (line 1070927 of 11906185), files are now being committed "DOCUMENT_COMMITTED_ADD" even though they have been indexed from the first run. All of the files since then have not been marked as "REJECTED_UNMODIFIED". The only thing that may have been an issue was during the 2nd crawl it was stopped pre-maturely stopped due to a server re-location during its run (and re-started with the "start" parameter).

After this crawl completes (sometime tomorrow) I will re-run it again and make sure it doesn't get stopped prematurely. However, I was expecting the previous files indexed would remain in the database. Is there any way to inspect the database or do more error testing to understand why this happened?

Thanks for your help John

jmrichardson commented 6 years ago

I tried to replicate the starting/stopping of the crawler on a dev server with a smaller amount of index files and was not able to replicate the issue. I tried "ctrl-c" as well as just shutting down the session. Each time the crawler correctly labeled each file as already indexed. I am not sure what the root cause could be.

jmrichardson commented 6 years ago

Hi, good news is this time all the files on another crawl attempt were recognized as already indexed. I am not sure what caused the above issue. The bad news is that the scan went from 1 day 14 hours to 1 day 8 hours. I was thinking that perhaps this was a database contention issue (50 threads running) but I believe H2 has multi-threaded support?

I did a bit of limited research and it appears that H2 may have some performance issues with multi-threading and small retrieval size. This article comparing H2 and Derby performance:

A huge performance gap has been detected when using multithreading with small retrieval size. Comparing the normalized speed of Hibernate with H2 database server (1.1) to the normalized speed of Hibernate with Derby database server (36.2) reveals that in that case, Hibernate with Derby server is 32.9 times faster than Hibernate with H2 server.

With this in mind, I tried changing the database to derby using this documentation but it looks like it is deprecated and ran into an error trying to run it.

Any ideas to improve performance? Thank you.

jmrichardson commented 6 years ago

Another update (BTW, I think this may belong in filesystem crawler instead of importer but not sure how to move it):

On a smaller directory with standard H2 (no setting changes in config xml), I was able to index the folder in 4:47. Re-crawling took 1:45. When I lowered the threads from 45 to 1, it took 2:02. So, the thread count did not have a significant impact on this simple test.

I have installed mongodb to see if that would help. The same folder took 9:52 to index (fresh index). However, the re-index took 1:00. I am not sure why the original index took over 9 minutes but the re-index was better. I am kicking off a fresh crawl on all 4.7M documents which should take a bit less than a couple of days. I am hoping that the index is not significantly more than H2 but I am interested in seeing the performance gain on re-index.

I will post back with the result

essiembre commented 6 years ago

I am assuming you are talking about the "crawl store"? If so, what about using the default (MVStore) implementation? Did you get performance issues there?

Using an embedded SQL database like H2 is fine for relatively small sites, but on large ones, you will start to see performance degradation for sure.

What kind of server do you have? Do you monitor the server resources when you run it with over 50 threads? How does it behave? 50 threads seem like a lot. How many CPUs do you have? If you specify more threads than CPUs, and all/many threads are busy, are you maxing your CPUs? If so, performance will be impacted for sure. Also, do you have lots of RAM? The parsing of files can be CPU and RAM intensive, so if having too many threads results in taking too much memory, it may result in disk swapping as well. Speaking of disks, if you have 50 threads writing heavily to disk, plus your H2 database... do you have an ultra-fast disk?

I would recommend lowering your number of threads and see if you get better performance.

Also, if you are maxing your server capacity and speed is essential, I would recommend you split your crawling into a few Collector instances running on different servers.

As far as not picking up files properly when you rerun it after a crash/stop, it is likely because you launched it with "start". When doing so, it starts from the beginning, but the "cache" from the previous run was incomplete, so it may cause issues. You are best to launch it using "resume" whenever it stops before completion so it picks up exactly where it was.

jmrichardson commented 6 years ago

Hi, Thanks for the reply. My testing prior to using Mongo was with the default MVStore. That is, I did not have any settings in my xml. I assumed that MVstore was using H2 (sorry not good with java at all)... But now I see that the default uses MVstore and I could potentially change it to H2 using the following (I haven't done this yet but I hope this is correct):

<crawlDataStoreFactory 
          class="com.norconex.collector.core.data.store.impl.jdbc.BasicJDBCCrawlDataStoreFactory"/>

I am currently still running a full index with mongodb as the backend. It is definitely taking significantly more time to index. Looking at the mongo log which shows slow queries, it eventually did start showing slow queries (I assume because the collection began to grow). The log shows any queries that are over 100ms. Here is a snip towards the beginning of the log:

2017-12-22T20:03:57.856-0500 I COMMAND  [conn213] command norconex.references command: count { count: "references", query: { stage: "QUEUED" } } planSummary: IXSCAN { stage: 1 } keyUpdates:0 writeConflicts:0 numYields:302 reslen:62 locks:{ Global: { acquireCount: { r: 606 } }, Database: { acquireCount: { r: 303 } }, Collection: { acquireCount: { r: 303 } } } protocol:op_query 103ms
2017-12-22T20:03:57.857-0500 I COMMAND  [conn195] command norconex.references command: count { count: "references", query: { stage: "QUEUED" } } planSummary: IXSCAN { stage: 1 } keyUpdates:0 writeConflicts:0 numYields:302 reslen:62 locks:{ Global: { acquireCount: { r: 606 } }, Database: { acquireCount: { r: 303 } }, Collection: { acquireCount: { r: 303 } } } protocol:op_query 106ms

Now, here is a snip at the end of the file (still growing):

2017-12-26T15:01:11.682-0500 I COMMAND  [conn270] command clients.references command: count { count: "references", query: { stage: "QUEUED" } } planSummary: IXSCAN { stage: 1 } keyUpdates:0 writeConflicts:0 numYields:10830 reslen:62 locks:{ Global: { acquireCount: { r: 21662 } }, Database: { acquireCount: { r: 10831 } }, Collection: { acquireCount: { r: 10831 } } } protocol:op_query 7055ms
2017-12-26T15:01:11.807-0500 I COMMAND  [conn236] command clients.references command: count { count: "references", query: { stage: "QUEUED" } } planSummary: IXSCAN { stage: 1 } keyUpdates:0 writeConflicts:0 numYields:10830 reslen:62 locks:{ Global: { acquireCount: { r: 21662 } }, Database: { acquireCount: { r: 10831 } }, Collection: { acquireCount: { r: 10831 } } } protocol:op_query 7357ms

As you can see, the time in ms has slowly grown from 100ms to around 7 to 8 seconds per query. I have verified that the collection does have an index on stage so don't think there is much we else we can do to improve the performance of the query. Almost all of the 1.8M entries in the log are from this count query. Any way to avoid this count query?

The servers has 2 Intel procs with 14 cores and 28 threads per with 128GB RAM and 1.2TB storage composed of RAID 5 300GB drives. So memory and disk IO is as fast as possible. I was hoping to maximize the ingest by assigning as many threads to do the ingest of data (I/O intensive). The server is still responsive to commands and other operations while ingest is happening and OS shows threads available when running with 50 threads in norconex, but it is definitely worthwhile to lower the thread count and see if that helps or resolve any contention (context switching). At the moment, everything is running on one server for this POC but will be adding servers to the cluster at a later date to run multiple ingest jobs.

I am wondering if it is worthwhile to continue down the mongo path? My hope was that even though the ingest may take considerably longer (or even if we can figure out a way to make ingest faster later with mongo perhaps by changing/removing the count query), the re-crawl would be significantly less for mongo? That is, if multi-threading/blocking was an issue for MVstore or H2, maybe mongo would be able to handle it better? It's been running for several days ... Is it helpful to share the results of the re-crawl of mongo for a base comparison once it finishes?

Thank you for the clarification on the start/resume operation. That makes sense now :)

essiembre commented 6 years ago

MVStore should not have multi-thread issues but it resides on the same JVM/machine so it can have its limitations. If Mongo is used in a distributed way, it may improve performance, but otherwise, you may be simply adding a network layer.

My recommendation would be to try to split your collector into smaller instances, each crawling a subset of your data. It is also easier to troubleshoot that way (e.g. you can fix/restart a subset without affecting the rest).

Updates are generally faster, but several factors can influence. For instance, if you are using a network file system and the network is your bottleneck, you may not perceive many improvements as it will need to go through your network for each document to find out if the file has changed (modified date, length, etc). It should definitely be less traffic to your target repository though (fewer calls to your committer).

jmrichardson commented 6 years ago

Had a chance to have another look at the performance issues we were having on ingest. I wasn't able to use mongo because the initial ingest of 5M files took 6+ days. However, after profiling mongo, the majority of the time was spent doing a count for references in the "QUEUED" stage (as mentioned above).

I modified the getReferencesCount function in file to:

protocted int getReferencesCount(IMongoSerializer.Stage stage) {
        if ( stage.name() == "QUEUED" ) {
          return (int) collRefs.count(
            eq(IMongoSerializer.FIELD_STAGE, stage.name()),
            new CountOptions().limit(1));
        } else {
          return (int) collRefs.count(
                eq(IMongoSerializer.FIELD_STAGE, stage.name()));
        }
    }

Or, the other option would just be to modify lines 233-236 from:

 @Override
    public boolean isQueueEmpty() {
        return getQueueSize() == 0;
    }

to something like:

@Override
public boolean isQueueEmpty() {
    int count = (int) collRefs.count(eq(IMongoSerializer.FIELD_STAGE, Stage.QUEUED.name()), new CountOptions().limit(1));
    return count == 0;
}

This avoids having to do a count of all records, and limits to just one. When ingesting 5M files, the count could take over 8 seconds each time when the number of records to count grew large.. while with the code update, the count took less than 100ms. After recompiling the collector core jar, the the initial ingest went from 6+ days to around 21 hours.

Mongo has always been much faster at re-ingest (just checking metadata) with multiple threads than the default mvstore. The best times after testing with single and multiple threads enabled om re-ingest using the default mvstore took around 36 hours. With mongo, re-ingest took 3.5 hours.

Wanted to pass the code change along for your consideration as it may help improve performance for others.

essiembre commented 6 years ago

Huge improvements! Thanks for sharing. Will definitely add your suggested changes in a near release.

essiembre commented 6 years ago

Your second option has been applied to the latest Collector Core snapshot. Please confirm.

jmrichardson commented 6 years ago

Works great! Thank you :)

Norconex / importer

Re-crawl doesn't recognize all files already indexed #73