Closed nscuro closed 1 year ago
We did encounter this as well. These two errors occur every time DT starts up. I was not able to fix this. First occurred after upgrading to 4.7.0 a few weeks ago.
2023-02-01 08:40:33,460 ERROR [IndexManager] The index vulnerability seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/vulnerability/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.CheckIndex.
@holdebeSICKAG Thanks for your input. Does this happen when restarting DT, or generally when starting (as in, a longer delay between stopping and starting the instance)?
Are you stopping the DT process gracefully (SIGTERM
) or forcefully (SIGKILL
)?
Thanks for replying @nscuro
It happens when restarting DT using compose. Also, I never had to forcefully shut down DT.
Observing the behavior for the past few days. The INDEX errror for vulnerability occurs only at start-up. It first wants to check the index and assumes it is corrupt as DT is not happy with the lock file. Then it re-creates the index but the same error occurs.
Furthermore, the error below occurs regurarly throughout the day, everytime a BOM was processed. So usually, I see these two log lines.
2023-02-02 07:38:45,714 INFO [BomUploadProcessingTask] Processed 233 components and 0 services uploaded to project d697145b-938e-4841-9395-46f121b6da34 2023-02-02 09:43:03,352 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-01T12:28:46.586660551Z, (lock=NativeFSLock(path=/data/.dependency-track/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T09:28:57.487377912Z)) at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191) at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61) at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802) at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249) at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65) at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)
If I can give you more information, please tell me. If you want me to test or change something, please tell me too.
We see the org.apache.lucene.store.AlreadyClosedException
directly after the upload of a SBOM.
2023-02-03 07:07:20,794 INFO [BomUploadProcessingTask] Identified 0 new components
2023-02-03 07:07:20,794 INFO [BomUploadProcessingTask] Processing CycloneDX dependency graph for project: 2ecb6ef9-c7a8-4a8d-b347-c1e6c42b43c9
2023-02-03 07:07:20,807 INFO [BomUploadProcessingTask] Processed 140 components and 0 services uploaded to project 2ecb6ef9-c7a8-4a8d-b347-c1e6c42b43c9
2023-02-03 07:07:20,807 INFO [RepositoryMetaAnalyzerTask] Performing component repository metadata analysis against 140 components
2023-02-03 07:07:20,819 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:07.312128577Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Thanks @holdebeSICKAG and @olafz!
So far it seems like the AlreadyClosedException
is occurring exclusively on the servicecomponent
index, can you confirm this?
Referring to this part of the exception here:
path=/data/.dependency-track/index/servicecomponent/write.lock
So far it seems like the
AlreadyClosedException
is occurring exclusively on theservicecomponent
index, can you confirm this?
The vast majority of the exceptions happens on the path=/data/.dependency-track/index/servicecomponent/write.lock
, however not exclusively. It also happens on path=/data/.dependency-track/index/vulnerability/write.lock
, path=/data/.dependency-track/index/component/write.lock
and path=/data/.dependency-track/index/cpe/write.lock
.
Some examples (with context) below:
2023-02-02 00:29:16,997 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/vulnerability/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:07.312128577Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
at org.apache.lucene.codecs.lucene80.Lucene80NormsConsumer.<init>(Lucene80NormsConsumer.java:45)
at org.apache.lucene.codecs.lucene80.Lucene80NormsFormat.normsConsumer(Lucene80NormsFormat.java:85)
at org.apache.lucene.index.DefaultIndexingChain.writeNorms(DefaultIndexingChain.java:389)
at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:221)
at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:17,948 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/component/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:04.920111164Z))
at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:121)
at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
at org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
at org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:48)
at org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:55)
at org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:452)
at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:488)
at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1757)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1400)
at org.dependencytrack.search.ComponentIndexer.add(ComponentIndexer.java:80)
at org.dependencytrack.search.ComponentIndexer.add(ComponentIndexer.java:44)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:58)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Just to rule this out:
/data
volume is mounted that could touch the index files?Separately, DT performs a health check on each index when it starts, e.g.:
2023-02-03 10:29:12,540 INFO [IndexSubsystemInitializer] Building lucene indexes if required
2023-02-03 10:29:12,557 INFO [IndexManager] Checking the health of index PROJECT
2023-02-03 10:29:12,635 INFO [IndexManager] The index PROJECT is healthy
2023-02-03 10:29:12,636 INFO [IndexManager] Checking the health of index COMPONENT
2023-02-03 10:29:12,638 INFO [IndexManager] The index COMPONENT is healthy
2023-02-03 10:29:12,638 INFO [IndexManager] Checking the health of index SERVICECOMPONENT
2023-02-03 10:29:12,639 INFO [IndexManager] The index SERVICECOMPONENT is healthy
2023-02-03 10:29:12,639 INFO [IndexManager] Checking the health of index VULNERABILITY
2023-02-03 10:29:12,641 INFO [IndexManager] The index VULNERABILITY is healthy
2023-02-03 10:29:12,641 INFO [IndexManager] Checking the health of index LICENSE
2023-02-03 10:29:12,778 INFO [IndexManager] The index LICENSE is healthy
2023-02-03 10:29:12,778 INFO [IndexManager] Checking the health of index CPE
2023-02-03 10:29:12,779 INFO [IndexManager] The index CPE is healthy
2023-02-03 10:29:12,779 INFO [IndexManager] Checking the health of index VULNERABLESOFTWARE
2023-02-03 10:29:12,780 INFO [IndexManager] The index VULNERABLESOFTWARE is healthy
Are there any occurrences in your logs where the logs were not healthy and had to be recreated?
Can you correlate the modification timestamps reported in the AlreadyClosedException
to some unusual event in the logs? For the last stack trace you shared @olafz, it says the component
index lock file was modified at 2023-02-02T00:29:15.592156452Z
.
I attached the log file entries from start to DT ready. Maybe it helps
2023-02-01 09:28:33,667 INFO [EmbeddedJettyServer] alpine-executable-war v2.2.0 (050a7a2a-12f7-4d02-b7b0-cd6ec432eac6) built on: 2022-12-14T20:24:42Z
2023-02-01 09:28:36,627 INFO [Config] --------------------------------------------------------------------------------
2023-02-01 09:28:36,628 INFO [Config] OS Name: Linux
2023-02-01 09:28:36,629 INFO [Config] OS Version: 5.4.0-135-generic
2023-02-01 09:28:36,629 INFO [Config] OS Arch: amd64
2023-02-01 09:28:36,630 INFO [Config] CPU Cores: 14
2023-02-01 09:28:36,633 INFO [Config] Max Memory: 19.2 GB (20,617,625,600.0 bytes)
2023-02-01 09:28:36,633 INFO [Config] Java Vendor: Eclipse Adoptium
2023-02-01 09:28:36,634 INFO [Config] Java Version: 17.0.5+8
2023-02-01 09:28:36,634 INFO [Config] Java Home: /opt/java/openjdk
2023-02-01 09:28:36,635 INFO [Config] Java Temp: /tmp
2023-02-01 09:28:36,635 INFO [Config] User: dtrack
2023-02-01 09:28:36,635 INFO [Config] User Home: /data/
2023-02-01 09:28:36,635 INFO [Config] --------------------------------------------------------------------------------
2023-02-01 09:28:36,635 INFO [Config] Initializing Configuration
2023-02-01 09:28:36,636 INFO [Config] System property alpine.application.properties not specified
2023-02-01 09:28:36,636 INFO [Config] Loading application.properties from classpath
2023-02-01 09:28:36,642 INFO [Config] --------------------------------------------------------------------------------
2023-02-01 09:28:36,642 INFO [Config] Application: Dependency-Track
2023-02-01 09:28:36,642 INFO [Config] Version: 4.7.1
2023-02-01 09:28:36,642 INFO [Config] Built-on: 2023-01-31T21:46:01Z
2023-02-01 09:28:36,642 INFO [Config] --------------------------------------------------------------------------------
2023-02-01 09:28:36,643 INFO [Config] Framework: Alpine
2023-02-01 09:28:36,643 INFO [Config] Version : 2.2.0
2023-02-01 09:28:36,643 INFO [Config] Built-on: 2022-12-14T20:24:42Z
2023-02-01 09:28:36,643 INFO [Config] --------------------------------------------------------------------------------
2023-02-01 09:28:36,732 INFO [RequirementsVerifier] Initializing requirements verifier
2023-02-01 09:28:36,732 INFO [UpgradeInitializer] Initializing upgrade framework
2023-02-01 09:28:39,311 INFO [PersistenceManagerFactory] Initializing persistence framework
2023-02-01 09:28:39,312 INFO [PersistenceManagerFactory] Creating transactional connection pool
2023-02-01 09:28:39,318 WARN [HikariConfig] transactional - idleTimeout is close to or more than maxLifetime, disabling it.
2023-02-01 09:28:39,343 INFO [PersistenceManagerFactory] Creating non-transactional connection pool
2023-02-01 09:28:39,344 WARN [HikariConfig] non-transactional - idleTimeout is close to or more than maxLifetime, disabling it.
2023-02-01 09:28:40,463 INFO [DefaultObjectGenerator] Initializing default object generator
2023-02-01 09:28:40,465 INFO [DefaultObjectGenerator] Synchronizing permissions to datastore
2023-02-01 09:28:40,673 INFO [DefaultObjectGenerator] Synchronizing SPDX license definitions to datastore
2023-02-01 09:28:44,079 INFO [DefaultObjectGenerator] Synchronizing default repositories to datastore
2023-02-01 09:28:44,099 INFO [DefaultObjectGenerator] Synchronizing config properties to datastore
2023-02-01 09:28:44,203 INFO [DefaultObjectGenerator] Synchronizing notification publishers to datastore
2023-02-01 09:28:44,482 INFO [CweImporter] Synchronizing CWEs with datastore
2023-02-01 09:28:46,543 INFO [CweImporter] CWE synchronization complete
2023-02-01 09:28:46,546 INFO [EventSubsystemInitializer] Initializing asynchronous event subsystem
2023-02-01 09:28:46,588 INFO [NotificationSubsystemInitializer] Initializing notification service
2023-02-01 09:28:46,591 INFO [IndexSubsystemInitializer] Building lucene indexes if required
2023-02-01 09:28:46,612 INFO [IndexManager] Checking the health of index PROJECT
2023-02-01 09:28:46,876 INFO [IndexManager] The index PROJECT is healthy
2023-02-01 09:28:46,877 INFO [IndexManager] Checking the health of index COMPONENT
2023-02-01 09:28:56,581 INFO [VulnerabilityMetricsUpdateTask] Executing metrics update on vulnerability database
2023-02-01 09:28:56,581 INFO [PortfolioMetricsUpdateTask] Executing portfolio metrics update
2023-02-01 09:28:56,661 INFO [ClearComponentAnalysisCacheTask] Clearing ComponentAnalysisCache
2023-02-01 09:28:56,694 INFO [OsvDownloadTask] Updating datasource with Google OSV advisories for ecosystem Alpine
2023-02-01 09:28:56,705 INFO [ClearComponentAnalysisCacheTask] Complete
2023-02-01 09:28:57,491 INFO [IndexManager] The index COMPONENT is healthy
2023-02-01 09:28:57,492 INFO [IndexManager] Checking the health of index SERVICECOMPONENT
2023-02-01 09:28:57,494 INFO [IndexManager] The index SERVICECOMPONENT is healthy
2023-02-01 09:28:57,495 INFO [IndexManager] Checking the health of index VULNERABILITY
2023-02-01 09:28:57,499 ERROR [IndexManager] The index vulnerability seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/vulnerability/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.CheckIndex.
Is there any service running on the system where the /data volume is mounted that could touch the index files?
No
Are you using a special file system for the mounted volume (network share, GlusterFS, etc.)?
No. It's a locally mounted filesystem (ext4
).
Are there any occurrences in your logs where the index were not healthy and had to be recreated?
Yes:
2023-02-02 00:29:15,594 INFO [IndexManager] Checking the health of index COMPONENT
2023-02-02 00:29:15,595 ERROR [IndexManager] The index component seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/component/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
at org.dependencytrack.search.IndexManager.lambda$checkIndexesConsistency$1(IndexManager.java:482)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
at org.dependencytrack.search.IndexManager.checkIndexesConsistency(IndexManager.java:479)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:51)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:15,595 INFO [IndexManager] (Re)Building index component
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the index servicecomponent
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the health of index SERVICECOMPONENT
2023-02-02 00:29:15,596 ERROR [IndexManager] The index servicecomponent seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/servicecomponent/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
at org.dependencytrack.search.IndexManager.lambda$checkIndexesConsistency$1(IndexManager.java:482)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
at org.dependencytrack.search.IndexManager.checkIndexesConsistency(IndexManager.java:479)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:51)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:15,596 INFO [IndexManager] (Re)Building index servicecomponent
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the index vulnerability
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the health of index VULNERABILITY
2023-02-02 00:29:15,596 ERROR [IndexManager] The index vulnerability seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/vulnerability/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
at org.dependencytrack.search.IndexManager.lambda$checkIndexesConsistency$1(IndexManager.java:482)
at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source)
at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
at org.dependencytrack.search.IndexManager.checkIndexesConsistency(IndexManager.java:479)
at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:51)
at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:15,596 INFO [IndexManager] (Re)Building index vulnerability
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the index license
@holdebeSICKAG @olafz Is there a correlation of indexes being reported as unhealthy on startup, and AlreadyClosedException
s of the same indexes later on?
Looking at the code, it's odd that the indexes are reporting an existing lock file. We do have a check in place that will delete existing lock files before checking the index's health:
Line 444
in the above snippet is where the exception is thrown:
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/component/write.lock
at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
At that point the file should've been deleted already. I verified that this is actually happening locally.
Potentially some sort of race condition...
@holdebeSICKAG @olafz Is there a correlation of indexes being reported as unhealthy on startup, and AlreadyClosedExceptions of the same indexes later on?
Especially for path=/data/.dependency-track/index/servicecomponent/write.lock
: no. That happens after SBOM uploads and there is no regular report about that index being unhealthy.
I've spun up a test instance on a server, added loads of projects, and will let it run for a while. So far I'm not seeing any index issues. Already tried restarting the instance at various times, but not able to trigger this behavior. Same is true locally, not able to reproduce.
@olafz @holdebeSICKAG Are you able to spin up another instance of DT on your servers and see if the errors occur there as well? Thanks a lot for your help so far!
This is the defect that I should have logged myself after talking to @nscuro back in January. That 2023-01-05T13:41:30.299072688Z
in the issue description is from my logs!
Anyway, it is still occuring for me with our production 4.7.1. 322 occurrences since last restart on 1st Feb.
It is also occurring on 4.8.0-SNAPSHOT (Built-on: 2023-02-01T18:41:56Z) but only 4 occurrences.
Both systems are deployed by ArgoCD K8s and are on Azure. volumeAttributes.fsType
is ext4
and we have not specified anything that would change that. As for external services (virus-scanning, etc), there is a Qualys agent installed
Both our systems have almost all the same projects and even have matching project uuid. BOMs uploaded to production are synched to test server via webhook and Benthos.
I find it interesting that there is a big disparity between the two systems when it comes to exception count. The test server was set up long after production server and thus never underwent a v4.00 upgrade migration, etc.
Maybe there is a need for a tool that could be run to validate local data? Or, what can be deleted and then be allowed to rebuild dynamically thus (and I say this as an optimistic non-dev) cleaning out bad data?
Thanks @msymons!
Maybe there is a need for a tool that could be run to validate local data?
At least for the indexes, that's already happening as of #2200. It will even rebuild the index when it detects that the index deviates to much from what's inside the DB.
Or, what can be deleted and then be allowed to rebuild dynamically thus (and I say this as an optimistic non-dev) cleaning out bad data?
The entire index
directory is fine to delete (after shutting down the DT instance of course). Indexes will be rebuilt on startup, they don't store anything that needs to be retained really.
Found the culprit.
Since 4.7.0 there is an index health check on startup, but also recurring every 72h (per default, configurable in UI at Administration -> Search -> Search - Periodic consistency check -> Cadence).
The health check uses Lucene's CheckIndex
. CheckIndex
does not only "check" the index, it actually tries to repair it if necessary. It is a writing operation that requires a write lock to be acquired. This is why DT currently deletes any existing lock files before starting CheckIndex
:
Write locks however are also held my the IndexWriter
s. There is an IndexWriter
for every index. Official guidance from Lucene recommends to create IndexWriter
once and then share the created instance among threads, for the entire lifetime of the application. So this is what DT does. IndexWriter
creates a write lock file when it is created, and only releases that lock when it is closed. Based on Lucene guidance, and how it is implemented in DT, IndexWriter
s are only closed when the application shuts down.
So this is what happens:
IndexWriter
s may be created at this pointwrite.lock
files that may have been created already, assuming they are stale
10s
IndexWriter
s and their lock files should have been createdAlreadyClosedException
, since the file has been recreated after the IndexWriter
created it. It has to assume that the lock it holds is ineffectiveAs a short-term fix, disable the periodic consistency check:
As for a proper fix in the code, I see multiple options:
IndexWriter
s before performing the health check, thus releasing all locks. Re-open the writers after the check completed.
IndexWriter
s, all those accesses will fail.CheckIndex
at allHello @nscuro
Thanks for the spot on analysis.
Index health check is performed, deleting any
write.lock
files that may have been created already, assuming they are stale
- Not sure how common this is, most scheduled tasks have an initial delay of at least
10s
I don't think that a race condition between the first task execution and initial health check is that common. I've rather implemented this check to prevent against hard process kill/crash which will leave stale lock file (too many bad experiences with flaky file based lock implementations :sweat:).
There's maybe another alternative for a fix :
Directory
instance used for each indexer instance is a singletonDirectory
instance is null, the stale lock file is deleted and Directory
instance createdDirectory
instance will be not null so it is fed to CheckIndex
and reuse the shared lock to ensure proper synchronization with write occuring at the same time@syalioune Are lock files supposed to / OK to be shared by multiple classes that perform write operations? In my naive thinking that would kind of defeat the purpose of the lock.
In my naive thinking
You and me both 😓. I was naively thinking that IndexWriter
instances would lock the directory on a strictly need-to-lock basis so that the lock could be used as a mutex with CheckIndex
. After looking at the code, it appears that the lock is obtained at the IndexWriter
instance creation and held throughout its lifetime.
In that case, I vote for
Only perform the health check on startup, when we can guarantee that no other write operations are being performed
The periodic task would only compute the delta ratio between DB and Lucene and fire REINDEX
events to the corresponding IndexWriter
Thanks @syalioune, I raised #2477 that does exactly that. I also added proper closing of the IndexWriter
s when the application shuts down.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Current Behavior
In some deployments, committing Lucene indexes may fail with an
AlreadyClosedException
, stating that thewrite.lock
file of the respective index was "changed by an external force":write.lock
is empty, and created when anIndexWriter
instance is created. DT creates anIndexWriter
instance per Index on startup. Write access to indexes is serialized using the single threadedExecutorService
, so there's no way an index could be modified by multiple threads.Lucene uses timestamps of
write.lock
to determine whether it has been modified after the index was opened.Steps to Reproduce
I am unable to reproduce locally. It seems like it's happening predominantly on servers, but I am yet unsure what exactly is the cause.
Could be third party services modifying the file, or file systems incapable of tracking file timestamps correctly, or skewed server times.
The exception can be forced by
touch
ing (i.e. modifying access timestamps) thewrite.lock
file after DT has started, e.g.:Afterwards, upload a BOM and monitor the log for
AlreadyClosedException
s.Expected Behavior
Committing indexes should not fail.
Dependency-Track Version
4.7.0
Dependency-Track Distribution
Container Image, Executable WAR
Database Server
N/A
Database Server Version
No response
Browser
N/A
Checklist