DependencyTrack / dependency-track

Dependency-Track is an intelligent Component Analysis platform that allows organizations to identify and reduce risk in the software supply chain.
https://dependencytrack.org/
Apache License 2.0
2.69k stars 578 forks source link

Index commit occasionally fails with `AlreadyClosedException` in some deployments #2379

Closed nscuro closed 1 year ago

nscuro commented 1 year ago

Current Behavior

In some deployments, committing Lucene indexes may fail with an AlreadyClosedException, stating that the write.lock file of the respective index was "changed by an external force":

ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-01-05T13:41:30.299072688Z, (lock=NativeFSLock(path=/data/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-01-05T10:41:31.278927236Z))
    at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
    at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
    at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
    at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
    at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
    at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

write.lock is empty, and created when an IndexWriter instance is created. DT creates an IndexWriter instance per Index on startup. Write access to indexes is serialized using the single threaded ExecutorService, so there's no way an index could be modified by multiple threads.

Lucene uses timestamps of write.lock to determine whether it has been modified after the index was opened.

Steps to Reproduce

I am unable to reproduce locally. It seems like it's happening predominantly on servers, but I am yet unsure what exactly is the cause.

Could be third party services modifying the file, or file systems incapable of tracking file timestamps correctly, or skewed server times.

The exception can be forced by touching (i.e. modifying access timestamps) the write.lock file after DT has started, e.g.:

touch ~/.dependency-track/index/component/write.lock

Afterwards, upload a BOM and monitor the log for AlreadyClosedExceptions.

Expected Behavior

Committing indexes should not fail.

Dependency-Track Version

4.7.0

Dependency-Track Distribution

Container Image, Executable WAR

Database Server

N/A

Database Server Version

No response

Browser

N/A

Checklist

holdebeSICKAG commented 1 year ago

We did encounter this as well. These two errors occur every time DT starts up. I was not able to fix this. First occurred after upgrading to 4.7.0 a few weeks ago. 2023-02-01 08:40:33,460 ERROR [IndexManager] The index vulnerability seems to be corrupted org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/vulnerability/write.lock at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) at org.apache.lucene.index.CheckIndex.(CheckIndex.java:425) at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444) at org.dependencytrack.search.IndexManager.lambda$ensureIndexesExists$0(IndexManager.java:415) at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source) at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source) at org.dependencytrack.search.IndexManager.ensureIndexesExists(IndexManager.java:414) at org.dependencytrack.search.IndexSubsystemInitializer.contextInitialized(IndexSubsystemInitializer.java:48) at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:1048) at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:624) at org.eclipse.jetty.server.handler.ContextHandler.contextInitialized(ContextHandler.java:983) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:740) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1304) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:900) at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:532) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:171) at org.eclipse.jetty.server.Server.start(Server.java:470) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:89) at org.eclipse.jetty.server.Server.doStart(Server.java:415) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93) at alpine.embedded.EmbeddedJettyServer.main(EmbeddedJettyServer.java:100) 2023-02-01 08:40:33,460 ERROR [IndexManager] Error committing index java.nio.file.NoSuchFileException: /data/.dependency-track/index/vulnerability/write.lock at java.base/sun.nio.fs.UnixException.translateToIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(Unknown Source) at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(Unknown Source) at java.base/sun.nio.fs.LinuxFileSystemProvider.readAttributes(Unknown Source) at java.base/java.nio.file.Files.readAttributes(Unknown Source) at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:189) at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37) at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:705) at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:699) at org.apache.lucene.index.IndexFileDeleter.deleteNewFiles(IndexFileDeleter.java:673) at org.apache.lucene.index.IndexWriter.deleteNewFiles(IndexWriter.java:5333) at org.apache.lucene.index.IndexWriter.access$600(IndexWriter.java:219) at org.apache.lucene.index.IndexWriter$1.lambda$deleteUnusedFiles$0(IndexWriter.java:436) at org.apache.lucene.index.IndexWriter$EventQueue.processEventsInternal(IndexWriter.java:354) at org.apache.lucene.index.IndexWriter$EventQueue.processEvents(IndexWriter.java:344) at org.apache.lucene.index.IndexWriter.processEvents(IndexWriter.java:5426) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3376) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249) at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65) at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

nscuro commented 1 year ago

@holdebeSICKAG Thanks for your input. Does this happen when restarting DT, or generally when starting (as in, a longer delay between stopping and starting the instance)?

Are you stopping the DT process gracefully (SIGTERM) or forcefully (SIGKILL)?

holdebeSICKAG commented 1 year ago

Thanks for replying @nscuro

It happens when restarting DT using compose. Also, I never had to forcefully shut down DT.

Observing the behavior for the past few days. The INDEX errror for vulnerability occurs only at start-up. It first wants to check the index and assumes it is corrupt as DT is not happy with the lock file. Then it re-creates the index but the same error occurs.

Furthermore, the error below occurs regurarly throughout the day, everytime a BOM was processed. So usually, I see these two log lines.

2023-02-02 07:38:45,714 INFO [BomUploadProcessingTask] Processed 233 components and 0 services uploaded to project d697145b-938e-4841-9395-46f121b6da34 2023-02-02 09:43:03,352 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-01T12:28:46.586660551Z, (lock=NativeFSLock(path=/data/.dependency-track/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T09:28:57.487377912Z)) at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191) at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61) at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802) at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084) at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460) at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770) at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728) at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249) at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65) at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source)

If I can give you more information, please tell me. If you want me to test or change something, please tell me too.

olafz commented 1 year ago

We see the org.apache.lucene.store.AlreadyClosedException directly after the upload of a SBOM.

2023-02-03 07:07:20,794 INFO [BomUploadProcessingTask] Identified 0 new components
2023-02-03 07:07:20,794 INFO [BomUploadProcessingTask] Processing CycloneDX dependency graph for project: 2ecb6ef9-c7a8-4a8d-b347-c1e6c42b43c9
2023-02-03 07:07:20,807 INFO [BomUploadProcessingTask] Processed 140 components and 0 services uploaded to project 2ecb6ef9-c7a8-4a8d-b347-c1e6c42b43c9
2023-02-03 07:07:20,807 INFO [RepositoryMetaAnalyzerTask] Performing component repository metadata analysis against 140 components
2023-02-03 07:07:20,819 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/servicecomponent/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:07.312128577Z))
    at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
    at org.apache.lucene.store.LockValidatingDirectoryWrapper.syncMetaData(LockValidatingDirectoryWrapper.java:61)
    at org.apache.lucene.index.SegmentInfos.prepareCommit(SegmentInfos.java:802)
    at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:5084)
    at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3460)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
    at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
nscuro commented 1 year ago

Thanks @holdebeSICKAG and @olafz!

So far it seems like the AlreadyClosedException is occurring exclusively on the servicecomponent index, can you confirm this?

Referring to this part of the exception here:

path=/data/.dependency-track/index/servicecomponent/write.lock
olafz commented 1 year ago

So far it seems like the AlreadyClosedException is occurring exclusively on the servicecomponent index, can you confirm this?

The vast majority of the exceptions happens on the path=/data/.dependency-track/index/servicecomponent/write.lock, however not exclusively. It also happens on path=/data/.dependency-track/index/vulnerability/write.lock, path=/data/.dependency-track/index/component/write.lock and path=/data/.dependency-track/index/cpe/write.lock.

Some examples (with context) below:

2023-02-02 00:29:16,997 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/vulnerability/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:07.312128577Z))
    at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
    at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
    at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
    at org.apache.lucene.codecs.lucene80.Lucene80NormsConsumer.<init>(Lucene80NormsConsumer.java:45)
    at org.apache.lucene.codecs.lucene80.Lucene80NormsFormat.normsConsumer(Lucene80NormsFormat.java:85)
    at org.apache.lucene.index.DefaultIndexingChain.writeNorms(DefaultIndexingChain.java:389)
    at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:221)
    at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350)
    at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:476)
    at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:656)
    at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3364)
    at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3770)
    at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3728)
    at org.dependencytrack.search.IndexManager.commit(IndexManager.java:249)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:65)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:17,948 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread
org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-02T00:29:15.592156452Z, (lock=NativeFSLock(path=/data/.dependency-track/index/component/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T19:00:04.920111164Z))
    at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191)
    at org.apache.lucene.store.LockValidatingDirectoryWrapper.createOutput(LockValidatingDirectoryWrapper.java:43)
    at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:43)
    at org.apache.lucene.codecs.compressing.CompressingStoredFieldsWriter.<init>(CompressingStoredFieldsWriter.java:121)
    at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsWriter(CompressingStoredFieldsFormat.java:130)
    at org.apache.lucene.codecs.lucene87.Lucene87StoredFieldsFormat.fieldsWriter(Lucene87StoredFieldsFormat.java:141)
    at org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter(StoredFieldsConsumer.java:48)
    at org.apache.lucene.index.StoredFieldsConsumer.startDocument(StoredFieldsConsumer.java:55)
    at org.apache.lucene.index.DefaultIndexingChain.startStoredFields(DefaultIndexingChain.java:452)
    at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:488)
    at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208)
    at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415)
    at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471)
    at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1757)
    at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1400)
    at org.dependencytrack.search.ComponentIndexer.add(ComponentIndexer.java:80)
    at org.dependencytrack.search.ComponentIndexer.add(ComponentIndexer.java:44)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:58)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
nscuro commented 1 year ago

Just to rule this out:

Separately, DT performs a health check on each index when it starts, e.g.:

2023-02-03 10:29:12,540 INFO [IndexSubsystemInitializer] Building lucene indexes if required
2023-02-03 10:29:12,557 INFO [IndexManager] Checking the health of index PROJECT
2023-02-03 10:29:12,635 INFO [IndexManager] The index PROJECT is healthy
2023-02-03 10:29:12,636 INFO [IndexManager] Checking the health of index COMPONENT
2023-02-03 10:29:12,638 INFO [IndexManager] The index COMPONENT is healthy
2023-02-03 10:29:12,638 INFO [IndexManager] Checking the health of index SERVICECOMPONENT
2023-02-03 10:29:12,639 INFO [IndexManager] The index SERVICECOMPONENT is healthy
2023-02-03 10:29:12,639 INFO [IndexManager] Checking the health of index VULNERABILITY
2023-02-03 10:29:12,641 INFO [IndexManager] The index VULNERABILITY is healthy
2023-02-03 10:29:12,641 INFO [IndexManager] Checking the health of index LICENSE
2023-02-03 10:29:12,778 INFO [IndexManager] The index LICENSE is healthy
2023-02-03 10:29:12,778 INFO [IndexManager] Checking the health of index CPE
2023-02-03 10:29:12,779 INFO [IndexManager] The index CPE is healthy
2023-02-03 10:29:12,779 INFO [IndexManager] Checking the health of index VULNERABLESOFTWARE
2023-02-03 10:29:12,780 INFO [IndexManager] The index VULNERABLESOFTWARE is healthy

Are there any occurrences in your logs where the logs were not healthy and had to be recreated?

Can you correlate the modification timestamps reported in the AlreadyClosedException to some unusual event in the logs? For the last stack trace you shared @olafz, it says the component index lock file was modified at 2023-02-02T00:29:15.592156452Z.

holdebeSICKAG commented 1 year ago

I attached the log file entries from start to DT ready. Maybe it helps

2023-02-01 09:28:33,667 INFO [EmbeddedJettyServer] alpine-executable-war v2.2.0 (050a7a2a-12f7-4d02-b7b0-cd6ec432eac6) built on: 2022-12-14T20:24:42Z 2023-02-01 09:28:36,627 INFO [Config] -------------------------------------------------------------------------------- 2023-02-01 09:28:36,628 INFO [Config] OS Name: Linux 2023-02-01 09:28:36,629 INFO [Config] OS Version: 5.4.0-135-generic 2023-02-01 09:28:36,629 INFO [Config] OS Arch: amd64 2023-02-01 09:28:36,630 INFO [Config] CPU Cores: 14 2023-02-01 09:28:36,633 INFO [Config] Max Memory: 19.2 GB (20,617,625,600.0 bytes) 2023-02-01 09:28:36,633 INFO [Config] Java Vendor: Eclipse Adoptium 2023-02-01 09:28:36,634 INFO [Config] Java Version: 17.0.5+8 2023-02-01 09:28:36,634 INFO [Config] Java Home: /opt/java/openjdk 2023-02-01 09:28:36,635 INFO [Config] Java Temp: /tmp 2023-02-01 09:28:36,635 INFO [Config] User: dtrack 2023-02-01 09:28:36,635 INFO [Config] User Home: /data/ 2023-02-01 09:28:36,635 INFO [Config] -------------------------------------------------------------------------------- 2023-02-01 09:28:36,635 INFO [Config] Initializing Configuration 2023-02-01 09:28:36,636 INFO [Config] System property alpine.application.properties not specified 2023-02-01 09:28:36,636 INFO [Config] Loading application.properties from classpath 2023-02-01 09:28:36,642 INFO [Config] -------------------------------------------------------------------------------- 2023-02-01 09:28:36,642 INFO [Config] Application: Dependency-Track 2023-02-01 09:28:36,642 INFO [Config] Version: 4.7.1 2023-02-01 09:28:36,642 INFO [Config] Built-on: 2023-01-31T21:46:01Z 2023-02-01 09:28:36,642 INFO [Config] -------------------------------------------------------------------------------- 2023-02-01 09:28:36,643 INFO [Config] Framework: Alpine 2023-02-01 09:28:36,643 INFO [Config] Version : 2.2.0 2023-02-01 09:28:36,643 INFO [Config] Built-on: 2022-12-14T20:24:42Z 2023-02-01 09:28:36,643 INFO [Config] -------------------------------------------------------------------------------- 2023-02-01 09:28:36,732 INFO [RequirementsVerifier] Initializing requirements verifier 2023-02-01 09:28:36,732 INFO [UpgradeInitializer] Initializing upgrade framework 2023-02-01 09:28:39,311 INFO [PersistenceManagerFactory] Initializing persistence framework 2023-02-01 09:28:39,312 INFO [PersistenceManagerFactory] Creating transactional connection pool 2023-02-01 09:28:39,318 WARN [HikariConfig] transactional - idleTimeout is close to or more than maxLifetime, disabling it. 2023-02-01 09:28:39,343 INFO [PersistenceManagerFactory] Creating non-transactional connection pool 2023-02-01 09:28:39,344 WARN [HikariConfig] non-transactional - idleTimeout is close to or more than maxLifetime, disabling it. 2023-02-01 09:28:40,463 INFO [DefaultObjectGenerator] Initializing default object generator 2023-02-01 09:28:40,465 INFO [DefaultObjectGenerator] Synchronizing permissions to datastore 2023-02-01 09:28:40,673 INFO [DefaultObjectGenerator] Synchronizing SPDX license definitions to datastore 2023-02-01 09:28:44,079 INFO [DefaultObjectGenerator] Synchronizing default repositories to datastore 2023-02-01 09:28:44,099 INFO [DefaultObjectGenerator] Synchronizing config properties to datastore 2023-02-01 09:28:44,203 INFO [DefaultObjectGenerator] Synchronizing notification publishers to datastore 2023-02-01 09:28:44,482 INFO [CweImporter] Synchronizing CWEs with datastore 2023-02-01 09:28:46,543 INFO [CweImporter] CWE synchronization complete 2023-02-01 09:28:46,546 INFO [EventSubsystemInitializer] Initializing asynchronous event subsystem 2023-02-01 09:28:46,588 INFO [NotificationSubsystemInitializer] Initializing notification service 2023-02-01 09:28:46,591 INFO [IndexSubsystemInitializer] Building lucene indexes if required 2023-02-01 09:28:46,612 INFO [IndexManager] Checking the health of index PROJECT 2023-02-01 09:28:46,876 INFO [IndexManager] The index PROJECT is healthy 2023-02-01 09:28:46,877 INFO [IndexManager] Checking the health of index COMPONENT 2023-02-01 09:28:56,581 INFO [VulnerabilityMetricsUpdateTask] Executing metrics update on vulnerability database 2023-02-01 09:28:56,581 INFO [PortfolioMetricsUpdateTask] Executing portfolio metrics update 2023-02-01 09:28:56,661 INFO [ClearComponentAnalysisCacheTask] Clearing ComponentAnalysisCache 2023-02-01 09:28:56,694 INFO [OsvDownloadTask] Updating datasource with Google OSV advisories for ecosystem Alpine 2023-02-01 09:28:56,705 INFO [ClearComponentAnalysisCacheTask] Complete 2023-02-01 09:28:57,491 INFO [IndexManager] The index COMPONENT is healthy 2023-02-01 09:28:57,492 INFO [IndexManager] Checking the health of index SERVICECOMPONENT 2023-02-01 09:28:57,494 INFO [IndexManager] The index SERVICECOMPONENT is healthy 2023-02-01 09:28:57,495 INFO [IndexManager] Checking the health of index VULNERABILITY 2023-02-01 09:28:57,499 ERROR [IndexManager] The index vulnerability seems to be corrupted org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/vulnerability/write.lock at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) at org.apache.lucene.index.CheckIndex.(CheckIndex.java:425) at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444) at org.dependencytrack.search.IndexManager.lambda$ensureIndexesExists$0(IndexManager.java:415) at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source) at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source) at org.dependencytrack.search.IndexManager.ensureIndexesExists(IndexManager.java:414) at org.dependencytrack.search.IndexSubsystemInitializer.contextInitialized(IndexSubsystemInitializer.java:48) at org.eclipse.jetty.server.handler.ContextHandler.callContextInitialized(ContextHandler.java:1048) at org.eclipse.jetty.servlet.ServletContextHandler.callContextInitialized(ServletContextHandler.java:624) at org.eclipse.jetty.server.handler.ContextHandler.contextInitialized(ContextHandler.java:983) at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:740) at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392) at org.eclipse.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1304) at org.eclipse.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:900) at org.eclipse.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:306) at org.eclipse.jetty.webapp.WebAppContext.doStart(WebAppContext.java:532) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93) at org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:171) at org.eclipse.jetty.server.Server.start(Server.java:470) at org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:114) at org.eclipse.jetty.server.handler.AbstractHandler.doStart(AbstractHandler.java:89) at org.eclipse.jetty.server.Server.doStart(Server.java:415) at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:93) at alpine.embedded.EmbeddedJettyServer.main(EmbeddedJettyServer.java:100) 2023-02-01 09:28:57,501 INFO [IndexManager] (Re)Building index vulnerability 2023-02-01 09:28:57,502 INFO [IndexManager] Checking the health of index LICENSE 2023-02-01 09:28:57,515 INFO [IndexManager] The index LICENSE is healthy 2023-02-01 09:28:57,515 INFO [IndexManager] Checking the health of index CPE 2023-02-01 09:28:57,535 ERROR [LoggableUncaughtExceptionHandler] An unknown error occurred in an asynchronous event or notification thread org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2023-02-01T09:28:57.491377909Z, (lock=NativeFSLock(path=/data/.dependency-track/index/vulnerability/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2023-02-01T08:40:33.453769847Z)) at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:191) at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:37) at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:705) at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:699) at org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:238) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1089) at org.dependencytrack.search.IndexManager.openIndex(IndexManager.java:191) at org.dependencytrack.search.IndexManager.getIndexWriter(IndexManager.java:202) at org.dependencytrack.search.VulnerabilityIndexer.remove(VulnerabilityIndexer.java:100) at org.dependencytrack.search.VulnerabilityIndexer.remove(VulnerabilityIndexer.java:44) at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:60) at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) 2023-02-01 09:28:57,623 INFO [VulnerabilityIndexer] Starting reindex task. This may take some time. 2023-02-01 09:28:57,626 INFO [IndexManager] Deleting vulnerability index 2023-02-01 09:28:59,387 INFO [IndexManager] The index CPE is healthy 2023-02-01 09:28:59,388 INFO [IndexManager] Checking the health of index VULNERABLESOFTWARE 2023-02-01 09:29:00,024 INFO [IndexManager] The index VULNERABLESOFTWARE is healthy 2023-02-01 09:29:00,049 INFO [GitHubAdvisoryMirrorTask] Updating datasource with GitHub advisories 2023-02-01 09:29:00,055 INFO [AlpineServlet] Starting Dependency-Track 2023-02-01 09:29:03,499 INFO [AlpineServlet] Dependency-Track is ready

olafz commented 1 year ago

Is there any service running on the system where the /data volume is mounted that could touch the index files?

No

Are you using a special file system for the mounted volume (network share, GlusterFS, etc.)?

No. It's a locally mounted filesystem (ext4).

Are there any occurrences in your logs where the index were not healthy and had to be recreated?

Yes:

2023-02-02 00:29:15,594 INFO [IndexManager] Checking the health of index COMPONENT
2023-02-02 00:29:15,595 ERROR [IndexManager] The index component seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/component/write.lock
    at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
    at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
    at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
    at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
    at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
    at org.dependencytrack.search.IndexManager.lambda$checkIndexesConsistency$1(IndexManager.java:482)
    at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
    at org.dependencytrack.search.IndexManager.checkIndexesConsistency(IndexManager.java:479)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:51)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:15,595 INFO [IndexManager] (Re)Building index component
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the index servicecomponent
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the health of index SERVICECOMPONENT
2023-02-02 00:29:15,596 ERROR [IndexManager] The index servicecomponent seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/servicecomponent/write.lock
    at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
    at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
    at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
    at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
    at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
    at org.dependencytrack.search.IndexManager.lambda$checkIndexesConsistency$1(IndexManager.java:482)
    at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
    at org.dependencytrack.search.IndexManager.checkIndexesConsistency(IndexManager.java:479)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:51)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:15,596 INFO [IndexManager] (Re)Building index servicecomponent
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the index vulnerability
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the health of index VULNERABILITY
2023-02-02 00:29:15,596 ERROR [IndexManager] The index vulnerability seems to be corrupted
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/vulnerability/write.lock
    at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
    at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
    at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
    at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
    at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)
    at org.dependencytrack.search.IndexManager.lambda$checkIndexesConsistency$1(IndexManager.java:482)
    at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Unknown Source)
    at java.base/java.util.stream.ReferencePipeline$Head.forEach(Unknown Source)
    at org.dependencytrack.search.IndexManager.checkIndexesConsistency(IndexManager.java:479)
    at org.dependencytrack.tasks.IndexTask.inform(IndexTask.java:51)
    at alpine.event.framework.BaseEventService.lambda$publish$0(BaseEventService.java:101)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
2023-02-02 00:29:15,596 INFO [IndexManager] (Re)Building index vulnerability
2023-02-02 00:29:15,596 INFO [IndexManager] Checking the index license
nscuro commented 1 year ago

@holdebeSICKAG @olafz Is there a correlation of indexes being reported as unhealthy on startup, and AlreadyClosedExceptions of the same indexes later on?

Looking at the code, it's odd that the indexes are reporting an existing lock file. We do have a check in place that will delete existing lock files before checking the index's health:

https://github.com/DependencyTrack/dependency-track/blob/93321f8a82f2eee4c2d42bb6ffc664ee54a82b4d/src/main/java/org/dependencytrack/search/IndexManager.java#L438-L444

Line 444 in the above snippet is where the exception is thrown:

org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /data/.dependency-track/index/component/write.lock
    at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139)
    at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41)
    at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45)
    at org.apache.lucene.index.CheckIndex.<init>(CheckIndex.java:425)
    at org.dependencytrack.search.IndexManager.isIndexHealthy(IndexManager.java:444)

At that point the file should've been deleted already. I verified that this is actually happening locally.

Potentially some sort of race condition...

olafz commented 1 year ago

@holdebeSICKAG @olafz Is there a correlation of indexes being reported as unhealthy on startup, and AlreadyClosedExceptions of the same indexes later on?

Especially for path=/data/.dependency-track/index/servicecomponent/write.lock: no. That happens after SBOM uploads and there is no regular report about that index being unhealthy.

nscuro commented 1 year ago

I've spun up a test instance on a server, added loads of projects, and will let it run for a while. So far I'm not seeing any index issues. Already tried restarting the instance at various times, but not able to trigger this behavior. Same is true locally, not able to reproduce.

@olafz @holdebeSICKAG Are you able to spin up another instance of DT on your servers and see if the errors occur there as well? Thanks a lot for your help so far!

msymons commented 1 year ago

This is the defect that I should have logged myself after talking to @nscuro back in January. That 2023-01-05T13:41:30.299072688Z in the issue description is from my logs!

Anyway, it is still occuring for me with our production 4.7.1. 322 occurrences since last restart on 1st Feb.

It is also occurring on 4.8.0-SNAPSHOT (Built-on: 2023-02-01T18:41:56Z) but only 4 occurrences.

Both systems are deployed by ArgoCD K8s and are on Azure. volumeAttributes.fsType is ext4 and we have not specified anything that would change that. As for external services (virus-scanning, etc), there is a Qualys agent installed

Both our systems have almost all the same projects and even have matching project uuid. BOMs uploaded to production are synched to test server via webhook and Benthos.

I find it interesting that there is a big disparity between the two systems when it comes to exception count. The test server was set up long after production server and thus never underwent a v4.00 upgrade migration, etc.

Maybe there is a need for a tool that could be run to validate local data? Or, what can be deleted and then be allowed to rebuild dynamically thus (and I say this as an optimistic non-dev) cleaning out bad data?

nscuro commented 1 year ago

Thanks @msymons!

Maybe there is a need for a tool that could be run to validate local data?

At least for the indexes, that's already happening as of #2200. It will even rebuild the index when it detects that the index deviates to much from what's inside the DB.

Or, what can be deleted and then be allowed to rebuild dynamically thus (and I say this as an optimistic non-dev) cleaning out bad data?

The entire index directory is fine to delete (after shutting down the DT instance of course). Indexes will be rebuilt on startup, they don't store anything that needs to be retained really.

nscuro commented 1 year ago

Found the culprit.

Since 4.7.0 there is an index health check on startup, but also recurring every 72h (per default, configurable in UI at Administration -> Search -> Search - Periodic consistency check -> Cadence).

image

The health check uses Lucene's CheckIndex. CheckIndex does not only "check" the index, it actually tries to repair it if necessary. It is a writing operation that requires a write lock to be acquired. This is why DT currently deletes any existing lock files before starting CheckIndex:

https://github.com/DependencyTrack/dependency-track/blob/7cfc466baf7548ddb758f9e7b008b2d4aaedd148/src/main/java/org/dependencytrack/search/IndexManager.java#L438-L442

Write locks however are also held my the IndexWriters. There is an IndexWriter for every index. Official guidance from Lucene recommends to create IndexWriter once and then share the created instance among threads, for the entire lifetime of the application. So this is what DT does. IndexWriter creates a write lock file when it is created, and only releases that lock when it is closed. Based on Lucene guidance, and how it is implemented in DT, IndexWriters are only closed when the application shuts down.

So this is what happens:

As a short-term fix, disable the periodic consistency check:

image

As for a proper fix in the code, I see multiple options:

syalioune commented 1 year ago

Hello @nscuro

Thanks for the spot on analysis.

Index health check is performed, deleting any write.lock files that may have been created already, assuming they are stale

  • Not sure how common this is, most scheduled tasks have an initial delay of at least 10s

I don't think that a race condition between the first task execution and initial health check is that common. I've rather implemented this check to prevent against hard process kill/crash which will leave stale lock file (too many bad experiences with flaky file based lock implementations :sweat:).

There's maybe another alternative for a fix :

nscuro commented 1 year ago

@syalioune Are lock files supposed to / OK to be shared by multiple classes that perform write operations? In my naive thinking that would kind of defeat the purpose of the lock.

syalioune commented 1 year ago

In my naive thinking

You and me both 😓. I was naively thinking that IndexWriter instances would lock the directory on a strictly need-to-lock basis so that the lock could be used as a mutex with CheckIndex. After looking at the code, it appears that the lock is obtained at the IndexWriter instance creation and held throughout its lifetime.

In that case, I vote for

Only perform the health check on startup, when we can guarantee that no other write operations are being performed

The periodic task would only compute the delta ratio between DB and Lucene and fire REINDEX events to the corresponding IndexWriter

nscuro commented 1 year ago

Thanks @syalioune, I raised #2477 that does exactly that. I also added proper closing of the IndexWriters when the application shuts down.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.