CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Importer possibly not cleaning up files/dbs between failed imports? #382

Open gdower opened 5 years ago

gdower commented 5 years ago

If a dataset fails to import, it seems like the CoL+ importer isn't cleaning up the files and/or databases that are created during by the failed import, which might result in the Neo4j database failing to create a new database. This bug makes it really slow going to debug what's wrong with datasets that are failing to import because I have to register a new dataset each time I make a change.

To produce the bug, I create a temporary github repo with the ACEF files for the 1201 database, which I registered in the clearinghouse. As expected, it failed to import.&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWBa0XR-f8lu3pmE7ete,key:environment,negate:!f,type:phrase,value:col),query:(match:(environment:(query:col-dev,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:level,negate:!t,type:phrase,value:DEBUG),query:(match:(level:(query:DEBUG,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:level,negate:!t,type:phrase,value:INFO),query:(match:(level:(query:INFO,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:dataset,negate:!f,type:phrase,value:'2054'),query:(match:(dataset:(query:'2054',type:phrase))))),index:AWGthDPVf8lu3pmEwvFE,interval:auto,query:(match_all:()),sort:!('@timestamp',desc))) Then I replaced those files with valid ACEF files from GSD1203, and tried to re-import from the previous registered dataset. I expected that to import successfully but it failed to import&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWBa0XR-f8lu3pmE7ete,key:environment,negate:!f,type:phrase,value:col),query:(match:(environment:(query:col-dev,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:level,negate:!t,type:phrase,value:DEBUG),query:(match:(level:(query:DEBUG,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:level,negate:!t,type:phrase,value:INFO),query:(match:(level:(query:INFO,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:dataset,negate:!f,type:phrase,value:'2054'),query:(match:(dataset:(query:'2054',type:phrase))))),index:AWGthDPVf8lu3pmEwvFE,interval:auto,query:(match_all:()),sort:!('@timestamp',desc))). Then I registered another new dataset and tried to import again with the exact same valid acef files dataset, and it imported successfully.

gdower commented 5 years ago

The same thing happens on prod)),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWGthDPVf8lu3pmEwvFE,key:level,negate:!t,type:phrase,value:DEBUG),query:(match:(level:(query:DEBUG,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWGthDPVf8lu3pmEwvFE,key:level,negate:!t,type:phrase,value:INFO),query:(match:(level:(query:INFO,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!t,index:AWGthDPVf8lu3pmEwvFE,key:dataset,negate:!f,type:phrase,value:'1201'),query:(match:(dataset:(query:'1201',type:phrase))))),predecessorCount:5,sort:!('@timestamp',desc),successorCount:5)&_g=(refreshInterval:(display:On,pause:!f,value:0),time:(from:now-7d,mode:quick,to:now))).

mdoering commented 5 years ago

Error indeed says cannot obtain lock file:

Caused by: org.neo4j.kernel.StoreLockException: Unable to obtain lock on store lock file: /home/col/bin/dev/scratch/2054/normalizer/store_lock. Please ensure no other process is using this database, and that the directory is writable (required even for read-only access)

This is strange as files should get cleared up and I cannot reproduce it in tests when I first import a failure then a success which works just fine.

gdower commented 5 years ago

Hmm, maybe it was caused by the deploy and wipe? I can try to reproduce it again tomorrow.

gdower commented 5 years ago

I was hitting this bug after deployment finished today, so I don't think it's deployment-related. If a dataset successfully imports, the importer doesn't seem to allow updating the dataset because it uses the previous copy and no errors are logged in Kibana.

gdower commented 5 years ago

Nevermind, I was wrong about this:

If a dataset successfully imports, the importer doesn't seem to allow updating the dataset because it uses the previous copy and no errors are logged in Kibana.

mdoering commented 5 years ago

@gdower with the latest deploy I cannot find an example that fails like that. I have reimported many failed ACEF datasets and they all work now. Can you try with your test one please?

gdower commented 5 years ago

The import test dataset works now, so I might not be able to test for the Normalizer bug. I'll re-open this if I can reproduce it again.

gdower commented 5 years ago

I'm still hitting this bug, so re-opening. Here's the stacktrace,query:(bool:(minimum_should_match:1,should:!((match_phrase:(level:WARN)),(match_phrase:(level:ERROR))))))))):

java.lang.IllegalStateException: Failed to init NormalizerStore at /home/col/bin/prod/scratch/1141/normalizer
    at org.col.importer.neo.NeoDbFactory.create(NeoDbFactory.java:49)
    at org.col.importer.neo.NeoDbFactory.persistentDb(NeoDbFactory.java:30)
    at org.col.importer.neo.NeoDbFactory.create(NeoDbFactory.java:64)
    at org.col.importer.ImportJob.importDataset(ImportJob.java:189)
    at org.col.importer.ImportJob.run(ImportJob.java:116)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Error starting org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory, /home/col/bin/prod/scratch/1141/normalizer
    at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:212)
    at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.newFacade(GraphDatabaseFacadeFactory.java:125)
    at org.neo4j.graphdb.factory.GraphDatabaseFactory.newDatabase(GraphDatabaseFactory.java:137)
    at org.neo4j.graphdb.factory.GraphDatabaseFactory.newEmbeddedDatabase(GraphDatabaseFactory.java:130)
    at org.neo4j.graphdb.factory.GraphDatabaseFactory$1.newDatabase(GraphDatabaseFactory.java:107)
    at org.neo4j.graphdb.factory.GraphDatabaseBuilder.newGraphDatabase(GraphDatabaseBuilder.java:199)
    at org.col.importer.neo.NeoDb.openNeo(NeoDb.java:170)
    at org.col.importer.neo.NeoDb.<init>(NeoDb.java:131)
    at org.col.importer.neo.NeoDbFactory.create(NeoDbFactory.java:46)
    ... 9 common frames omitted
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'org.neo4j.kernel.internal.locker.StoreLockerLifecycleAdapter@649c126e' was successfully initialized, but failed to start. Please see the attached cause exception "Unable to obtain lock on store lock file: /home/col/bin/prod/scratch/1141/normalizer/store_lock. Please ensure no other process is using this database, and that the directory is writable (required even for read-only access)".
    at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:466)
    at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:107)
    at org.neo4j.kernel.impl.factory.GraphDatabaseFacadeFactory.initFacade(GraphDatabaseFacadeFactory.java:208)
    ... 17 common frames omitted
Caused by: org.neo4j.kernel.StoreLockException: Unable to obtain lock on store lock file: /home/col/bin/prod/scratch/1141/normalizer/store_lock. Please ensure no other process is using this database, and that the directory is writable (required even for read-only access)
    at org.neo4j.kernel.internal.locker.StoreLocker.storeLockException(StoreLocker.java:117)
    at org.neo4j.kernel.internal.locker.StoreLocker.unableToObtainLockException(StoreLocker.java:110)
    at org.neo4j.kernel.internal.locker.GlobalStoreLocker.haveLockAlready(GlobalStoreLocker.java:73)
    at org.neo4j.kernel.internal.locker.StoreLocker.checkLock(StoreLocker.java:65)
    at org.neo4j.kernel.internal.locker.GlobalStoreLocker.checkLock(GlobalStoreLocker.java:60)
    at org.neo4j.kernel.internal.locker.StoreLockerLifecycleAdapter.start(StoreLockerLifecycleAdapter.java:36)
    at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:445)
    ... 19 common frames omitted

Frontend version: 6058086 August 21, 2019 7:04 AM Backend version: a1681bf August 21, 2019 1:33 PM

gdower commented 5 years ago

@mdoering, rebooting does fix the issue, so it's not critical. I'll remove the important label.

mdoering commented 5 years ago

that might be related to prod being frozen earlier today when I had to restart it. That might well be a situation when lock files are not cleaned up. We should clear all neo4j files when doing a restart I reckon