devbridge / BetterCMS

A publishing focused and developer friendly .NET Open Source CMS.
GNU Lesser General Public License v3.0
374 stars 152 forks source link

Lucene search module: needs a way to prevent indexing of pages (like 404, 500, etc.) #1261

Closed ghost closed 8 years ago

JuliusSenkus commented 9 years ago

Finished added another field in configuration "LuceneExcludedPages", works as "LuceneExcludedClasses", "LuceneExcludedIds" or "LuceneExcludedNodes", values can be like "http://bettercms.sandbox.mvc4.local/,http://bettercms.sandbox.mvc4.local/500/" or "/,/500/" or mixed.

JuliusSenkus commented 9 years ago

Fix available from 1.10.4-beta7+

daivabrazukaite commented 8 years ago

Currently newly added pages indexing is not working at all. Info from logs after restart: 2015-10-07 12:32:34.4595 Starting Lucene Content Indexing Robot.

2015-10-07 12:32:34.4595 Starting Lucene Index Source Watcher.

2015-10-07 12:32:38.7876 Lucene Index Source Watcher finished looking for new sources.

2015-10-07 12:32:48.0384 Failed to delete write lock file 'write.lock' in directory 'D:\home\site\wwwroot../../Lucene.BetterCms'. The process cannot access the file 'D:\home\Lucene.BetterCms\write.lock' because it is being used by another process. System.IO.IOException: The process cannot access the file 'D:\home\Lucene.BetterCms\write.lock' because it is being used by another process. at System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) at System.IO.File.InternalDelete(String path, Boolean checkHost) at System.IO.File.Delete(String path) at BetterCMS.Module.LuceneSearch.Services.IndexerService.DefaultIndexerService.CleanLock() 2015-10-07 12:32:48.1165 Starting Lucene Index Source Watcher.

2015-10-07 12:32:48.1946 Starting Lucene Content Indexing Robot.

2015-10-07 12:32:51.5472 Failed to open Lucene index writer. Write lock file is locked.

2015-10-07 12:32:51.5472 Lucene Content Indexing Robot cannot continue. Failed to open writer.

2015-10-07 12:32:54.3441 Lucene Index Source Watcher finished looking for new sources.

2015-10-07 12:32:56.9118 Lucene web crawler: Failed to authenticate user. The remote server returned an error: (403) Forbidden. System.Net.WebException: The remote server returned an error: (403) Forbidden. at System.Net.HttpWebRequest.GetResponse() at BetterCMS.Module.LuceneSearch.Services.WebCrawlerService.DefaultWebCrawlerService.TryAuthenticate() 2015-10-07 12:32:56.9274 Lucene web crawler: Failed to fetch page by url /test-page-1007-2/. The remote server returned an error: (403) Forbidden. System.Net.WebException: The remote server returned an error: (403) Forbidden. at System.Net.HttpWebRequest.GetResponse() at BetterCMS.Module.LuceneSearch.Services.WebCrawlerService.DefaultWebCrawlerService.FetchPage(String url) 2015-10-07 12:32:57.0993 Lucene Content Indexing Robot finished indexing.

Config content:

<add key="LuceneWebSiteUrl" value="http://bettercmsdemo.devbstaging.com/" />
<add key="LuceneFileSystemDirectory" value="../../Lucene.BetterCms" />
<add key="LuceneIndexerFrequency" value="00:05:00" />
<add key="LucenePagesWatcherFrequency" value="00:05:00" />
<add key="LuceneMaxPagesPerQuery" value="10000" />
<add key="LucenePageExpireTimeout" value="00:01:00" />
<add key="LuceneDisableStopWords" value="true" />
<add key="LuceneSearchForPartOfWords" value="true" />
<add key="LuceneIndexPrivatePages" value="true" />
<add key="LuceneAuthorizationMode" value="Forms" />
<add key="LuceneAuthorizationUrl" value="http://bettercmsdemo.devbstaging.com/login" />
<add key="LuceneAuthorizationForm.UserName" value="admin" />
<add key="LuceneAuthorizationForm.Password" value="admin" />
<add key="LuceneAuthorizationForm.RememberMe" value="true" />
<add key="LuceneIndexerDeleteLockFileOnStart" value="true" />
Audrunas commented 8 years ago

FIxed. The problem was with staging IP addresses resrictions. I've added both load balanced IPs and search indexed works again.

daivabrazukaite commented 8 years ago

it is still the same. Please check once more.

ghost commented 8 years ago

IP issue fixed.

daivabrazukaite commented 8 years ago

It was created page with url /test-page-1125-4/ -> config file was updated to contain row:

After search watcher was run, row for this page in bcms_lucene.IndexSources was created, so page was indexed, even it hadn't be (p.s.: record in bcms_lucene.IndexSources was created after config update, not before) This page was not included into search results if trying to search - if such as intend and is, please add note as from task description it looks like indexing itself should not be done.

ghost commented 8 years ago

yep, indexer it self - will index such a page, but it will not show as search results. This gives the possibility to hide even already indexed pages.