episerver / content-search-lucene

Apache License 2.0
1 stars 1 forks source link

Indexing performance? #29

Open ellinge opened 4 months ago

ellinge commented 4 months ago

I'm wondering if there's any known issues regarding performance for the dockerized WebAPI-solution compared to the old .NET Framework-WCF solution. I've tried to index a quite large site which takes perhaps about an hour with CMS 11 but I have not yet finished a indexing when trying it with docker. I've also forked the solution and made an internal NuGet-version instead which runs with the site instead like the old packages but that is as slow. I've noticed that the virus scan / windows defender works quite a lot since the indexing causes a lot of IO-traffic. But is that more than the old solution? I've noticed that the Lucene-packages are in beta still so perhaps it's an underlying issue. Not sure how to make a benchmark between the two to showcase where any bottlenecks seem to have been introduced.

ellinge commented 1 month ago

Found a difference now, previously the scheduled job has put it on a queue instead of actually index it directly. There's an in memory queue now still but the indexing is then performed in the scheduled job directly, so it takes a very long to finish (if ever).

I will try to see how long the actual indexing takes in the old env vs the new to see if there seems to be some bottleneck in the new approach.

ellinge commented 3 days ago

Seem to have found some optimization paths (our indexing takes about 11% of the time compared to the unoptimized code). We've forked the repo to a local azure devops server to be able to tweak some of the aspects to be more familiar to the CMS 11-experience. We need the lucene index since we have a lot of items to be searchable in edit mode but not part of our Search & Navigation-index. Our solutions are currently hosted on premise.

These changes made it more robust since a stopped site/scheduled job doesn't lose the indexing queue. But we still had a lot slower processing of the que/indexing when comparing to CMS 11.

Here's what we did to solve it and even improving it compared to CMS 11. When profiling the indexing I noticed a lot of the time was in the disposing of IndexWriter. According to Class IndexWriter | Apache Lucene.NET 4.8.0-beta00009 Documentation the IndexWriter should be reusable (and threadsafe) and disposing it can be costly. And for each item in the batch there’s several IndexWriters created so one batch disposes a lot of these. But since it puts a lock on the folder one cannot keep it around forever. But during the batchprocessing of items to index (default 50 items) one can make sure that the same instance (per directory) is reused. We created a IndexingScope which resuses instances of the IndexWriter and IndexSearcher which implements IDisposable and makes sure to dispose the instances after use (one needs to commit after some operations when reusing, we also changed the remove/add steps on a Update to just call UpdateDocument in lucene instead and don't call Commit in that case). This made my local indexing (a cleared indexing folder) go from 4,5 hours to about 30 minutes instead (and about ~30 minutes for queueing / the job, so basically as fast as they are queued). In CMS 11 it took about 3 hours for a similar database backup.

                indexingScope.UseIndexWriter(namedIndex.Directory, writer =>
                {
                    if (isUpdate)
                    {
                        writer.UpdateDocument(new Term(IndexingServiceSettings.IdFieldName, itemId), doc);
                    }
                    else
                    {
                        writer.AddDocument(doc);
                    }
                }, doCommitAfterUseOnReuse: !isUpdate);

We have checked with Luke (Lucene Index Toolbox) and queried in CMS and the index seems fine when comparing a before/after index folder. Before optimization-folder in luke (cleared before indexing) lukebeforeopti After optimization-folder in luke (cleared before indexing) lukeafteropti

It's not optimally implemented though. ILuceneHelper is registered as a Singleton now so I simply passed the IndexingScope down in each method used there to be able to reuse it. It would better to make the IndexingScope created per call/transient to the service instead. Perhaps by having factory or a static Current-property of some kind which creates a new scope per HTTP request.

        /// <summary>
        /// Updates the Lucene index from the passed syndication feed
        /// </summary>
        /// <param name="feed">The feed to process</param>
        public void UpdateIndex(FeedModel feed)
        {
            _logger.LogDebug(string.Format("Start processing feed '{0}'", feed.Id));

            using var indexingScope = new IndexingScope(reuseWriterInstances: true);

            foreach (var item in feed.Items)
...
                        // If no callback data uri is defined, we handle the item in the current request thread
                        switch (indexAction)
                        {
                            case "add":
                                _luceneHelper.Add(item, namedIndex, indexingScope);
                                break;

                            case "update":
                                _luceneHelper.Update(item, namedIndex, indexingScope);
                                break;

                            case "remove":
                                _luceneHelper.Remove(item, namedIndex, indexingScope);
                                break;
                        }

                        // If this item is a reference item we need to update the parent document to
                        // reflect changes in the reference index. e.g. comments.
                        if (!string.IsNullOrEmpty(referenceId))
                        {
                            _luceneHelper.UpdateReference(
                                referenceId,
                                item.Id,
                                new NamedIndex(namedIndexName),
                                indexingScope); // Always main index

                            _logger.LogDebug(string.Format("Updated reference with reference id '{0}' ", referenceId));
                        }