etf-validator / etf-webapp

:earth_africa: :mag: ETF is an open source testing framework for spatial data and services
https://www.etf-validator.net
European Union Public License 1.2
18 stars 19 forks source link

Partial execution of cleanup operations #208

Closed carlospzurita closed 2 years ago

carlospzurita commented 5 years ago

Description

On the production deployment of the INSPIRE validator, there has been detected a problem with the periodic cleanup operations of test data. This operations take place every day at midnight, and checks if there are test reports and associated files older than the parameters etf.testobject.temporary.lifetime.expiration and etf.testreports.lifetime.expiration on the ETF properties file.

This operation is not executed completely, some files are left behind. We have identified the following files on the ETF directories:

All these directories and files were created using the user jetty, and the owner and the group have permission to read and write on them.

The size of the ETF file system increases considerably because of this problem, and any operation (listing test-reports, restart of the instance) takes more and more time to execute.

Operating systems and browser

Steps to Reproduce

[Describe the exact steps which can be used to reproduce the problem]

  1. Set on the properties a short cleanup time (60 min, for example)
  2. Created several test runs 2.1. Both tests for Metadata 1.3, with remote and uploaded file 2.2. WMS service 2.3. WFS service
  3. Wait for cleanup to be executed as scheduled
  4. Run a diff -qr between a copy of the ETF file system before cleanup, and after.

Expected behavior: [Which behavior you expected to see instead and why.]

The described files should be removed from the file system, given that their TestRun have been removed.

Actual behavior: [Describe the behavior you observed after following the steps]

The files are kept on the file system.

carlospzurita commented 5 years ago

One exception that we keep encountering is this one

ERROR d.i.e.w.c.TestResultController - Report not found: 
de.interactive_instruments.exceptions.ObjectWithIdNotFoundException: TestRunDao: An object with the id "8c8d8fcf-a732-4adf-b95f-da65406f1f63" could not be found!

Also, we noticed that the ffiles in /etf/ds/attachments and /etf/testdata were not emptied, so we set up a crontab on the cloud instance to execute the cleanup operation.

sudo find /efs/ds/obj -type f -name 'TestObject-*' -mtime +7 -exec rm {} \;
sudo find /efs/ds/obj -type f -name 'TestRun-*' -mtime +7 -exec rm {} \;
sudo find /efs/ds/obj -type f -name 'TestTaskResult-*' -mtime +7 -exec rm {} \;

sudo find /efs/ds/attachments -name '*' -mtime +7 -exec rm '-rf' {} \;

sudo find /efs/testdata -name '*' -mtime +7 -exec rm '-rf' {} \;

¿Maybe the TestRuns references were kept on the database, or the Java cache, but deleted from the datastore, causing the exception and preventing from executing the cleanup of the rest of the files?

carlospzurita commented 4 years ago

Related with this issue, a problem periodically arises with the database used in the INSPIRE validator . We are not really sure about the cause, but every time the issue can be solved deleting the database files and recreating it again, using the XML in the obj/ folder.

We can't find a pattern in the errors for this. One time there was an incomplete WMS test report that was causing the exception in the storage service. Other time the issue seemed to be related with missing EIDs. The last time the storage service throwed a concurrency exception.

This happens after some time of functioning. We don't know if it because of the volume of operations performed with the storage service, or the size of the data that cause these inconsistencies. As I said, after deleting the database files and recreating it, the validator works as intended again, and all the EIDs are recreated just fine. In the workshop organized in Ispra for validator users, we found that this problem was also happening to another users with their instance deployed, and periodically they had to restart the database.

We think that the cleanup issue can be related with this problems, because the cleanup was also throwing exception about missing references. We would like to know if there is some way to configure the BaseX database as a production environment, as we understand that this is not a normal behaviour.

Attached you can find a couple of logs that we extracted before restarting the instance. etf-logs.zip

carlospzurita commented 4 years ago

We had another crash on the production instance. This time provoked by a group of TestRuns that were stucked on execution and didn't let execute any other of the qeued tasks. Some of the tests were marked as illegal state transition from CREATED to FINALIZING. Also, as it was happening in other instances, we find references to the exception ObjectWithIdNotFoundException.

We have setup a procedure for this cases to stop the Docker container that is running the validator, delete all the files in the volume folder /etf/ds/db/data, and run the container again. This keeps all the test reports saved in the obj/ folder, and just recreates them in BaseX. After this procedure is completed, the validator is back to normal. As long as we don't find the cause for this errors, we think that a periodic restart of the database following this procedure can avoid the crashes.

We also provide the log file from this latest crash. etf-121219.zip

jonherrmann commented 4 years ago

How many cores does the machine have?

carlospzurita commented 4 years ago

In order to avoid locks in application and have fresh start on the database every day, we have included a cron job in the AWS instance to execute a cleanup of objects and a recreation of the database, every night. For now, we are testing it on a separated instance that we are using for metadata validation, before setting it up on production.

The code is as follows

sudo find /efs/ds/obj -type f -name 'TestObject-*' -mtime +7 -exec rm {} \;
            sudo find /efs/ds/obj -type f -name 'TestRun-*' -mtime +7 -exec rm {} \;
            sudo find /efs/ds/obj -type f -name 'TestTaskResult-*' -mtime +7 -exec rm {} \;

            sudo find /efs/ds/attachments -name '*' -mtime +7 -exec rm '-rf' {} \;

            sudo find /efs/testdata -name '*' -mtime +7 -exec rm '-rf' {} \;

            sudo docker pause $(sudo docker ps -q)

            sudo rm -rf /efs/ds/db/data/*

            sudo docker restart $(sudo docker ps -q)

            exit 0

For now, the job is executing correctly. We need to see how it works in an instance with heavy workload.

carlospzurita commented 4 years ago

For the validation of metadata records in the INSPIRE Geoportal, we setup a dedicated instance for it to be used an endpoint for a PID workflow. This instance is under heavy use right now, and we have found some issues.

The main thing that we noticed was that the instance gets slowed down pretty quickly. With a fresh instance we go through the validations at a good speed, but the performance is degrading by the hour.

For example, we started with an instance of 2 CPU cores, and 8 GB RAM, and a configuration of 2 parallel TestRuns. We went from an initial 500 validations per hour, and after 8 hours we were down to 300 per hour. We increased the number of TestRuns in parallel on the ETF to 6. In this setup, we started with 1400 validations per hour, but the performance went down even quicker: after 4 hours, we were again in the 300-400 validations per hour.

We increased the capacity of the instance and established 4 CPU cores, with 10 parallel TestRuns. The initial rate went up, but in a couple of hours we went down. We also tried to parallelize the TestRun output by setting multiple PDI instances, but this only show good performance in the first hour or so.

We see that the CPU percentage behaves the same way: we start at a 60% usage of CPU, and we go down to a 25-30% very quickly. The thread count is always the same, so they have to be occupied with disk operations. Looking at the iotop on the instance, we see 4-5 threads at best writing on the disk.

imagen

We have to possible explanations in mind: either the file system gets overloaded by the number of TestRuns, and the disk operations get slower; or the data storage get significantly slower with the number of TestRuns and TestObjects created. Also, we would like to confirm if we should continue with a higher number of TestRuns or should we stick to the number of available CPU cores.

jonherrmann commented 4 years ago

For the validation of metadata records in the INSPIRE Geoportal, we setup a dedicated instance for it to be used an endpoint for a PID workflow.

I need to ask, because I don't understand the abbreviation in this context: what's a PID workflow?

I would like to make sure that we are moving in the right direction: the issue is about stability of ETF, so we have to find a reproducible scenario in which ETF gets into an inconsistent state that can be analysed.

carlospzurita commented 4 years ago

PDI is acronym for Pentaho Data Integration. Through this scripts we are putting a great work load on the instance and we find that the instance get's slowed down a lot with a big number of TestObjects and TestRuns. Doing the cleanup manually (deleting all the files in the instance volume and launching the docker container again) often times rendered in an inconsistent state, using the cleanup shown in https://github.com/etf-validator/etf-webapp/issues/208#issuecomment-574538082

Also, deleting TestRuns and TestObjects using the API operations throwed some errors from BaseX

Contact: basex-talk@mailman.uni-konstanz.de
Version: BaseX 8.6.7
Java: Oracle Corporation, 1.8.0_66-internal
OS: Linux, amd64
Stack Trace: 
java.lang.ArrayIndexOutOfBoundsException

    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxPreparedDto.lambda$streamTo$0(AbstractBsxPreparedDto.java:100)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.basex.core.BaseXException: Improper use? Potential bug? Your feedback is welcome:
Contact: basex-talk@mailman.uni-konstanz.de
Version: BaseX 8.6.7
Java: Oracle Corporation, 1.8.0_66-internal
OS: Linux, amd64
Stack Trace: 
java.lang.ArrayIndexOutOfBoundsException

    at org.basex.core.Command.execute(Command.java:94)
    at de.interactive_instruments.etf.dal.dao.basex.BsXQuery.execute(BsXQuery.java:101)
    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxPreparedDto.lambda$streamTo$0(AbstractBsxPreparedDto.java:98)
    ... 1 more
Error 
  SXXP0003: Error reported by XML parser: Premature end of file.
Exception in thread "Thread-8505" de.interactive_instruments.etf.dal.dao.basex.BsxPreparedDtoException: org.basex.core.BaseXException: Improper use? Potential bug? Your feedback is welcome:
Contact: basex-talk@mailman.uni-konstanz.de
Version: BaseX 8.6.7
Java: Oracle Corporation, 1.8.0_66-internal
OS: Linux, amd64
Stack Trace: 
java.lang.RuntimeException: Data Access out of bounds:
- pre value: 57159
- table size: 56903
- first/next pre value: 56953/56903
- #total/used pages: 368/367
- accessed page: 366 (367 > 366]
    at org.basex.util.Util.notExpected(Util.java:62)
    at org.basex.io.random.TableDiskAccess.cursor(TableDiskAccess.java:455)
    at org.basex.io.random.TableDiskAccess.read1(TableDiskAccess.java:148)
    at org.basex.data.Data.kind(Data.java:293)
    at org.basex.query.value.node.DBNode$4.next(DBNode.java:337)
    at org.basex.query.value.node.DBNode$4.next(DBNode.java:1)
    at org.basex.query.iter.BasicNodeIter$2.hasNext(BasicNodeIter.java:39)
    at org.basex.query.expr.path.IterStep$1.next(IterStep.java:38)
    at org.basex.query.expr.path.IterStep$1.next(IterStep.java:1)
    at org.basex.query.expr.path.IterPath$1.next(IterPath.java:50)
    at org.basex.query.expr.path.IterPath$1.next(IterPath.java:1)
    at org.basex.query.iter.Iter.value(Iter.java:61)
    at org.basex.query.expr.ParseExpr.value(ParseExpr.java:68)
    at org.basex.query.QueryContext.value(QueryContext.java:406)
    at org.basex.query.expr.gflwor.Let$LetEval.next(Let.java:185)
    at org.basex.query.expr.gflwor.GFLWOR.value(GFLWOR.java:70)
    at org.basex.query.QueryContext.value(QueryContext.java:406)
    at org.basex.query.func.StaticFunc.invValue(StaticFunc.java:182)
    at org.basex.query.func.FuncCall.invoke(FuncCall.java:87)
    at org.basex.query.func.FuncCall.value(FuncCall.java:61)
    at org.basex.query.func.FuncCall.iter(FuncCall.java:66)
    at org.basex.query.QueryContext.iter(QueryContext.java:395)
    at org.basex.query.scope.MainModule.iter(MainModule.java:121)
    at org.basex.query.QueryContext.iter(QueryContext.java:331)
    at org.basex.query.QueryProcessor.iter(QueryProcessor.java:90)
    at org.basex.core.cmd.AQuery.query(AQuery.java:92)
    at org.basex.core.cmd.XQuery.run(XQuery.java:22)
    at org.basex.core.Command.run(Command.java:257)
    at org.basex.core.Command.execute(Command.java:93)
    at de.interactive_instruments.etf.dal.dao.basex.BsXQuery.execute(BsXQuery.java:101)
    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxPreparedDto.lambda$streamTo$0(AbstractBsxPreparedDto.java:98)
    at java.lang.Thread.run(Thread.java:745)

So right now, what we have resort to is to completely clean the filesystem every night, and starting the Docker container again.

carlospzurita commented 4 years ago

Continuing with the discussion on performance derived from the monitoring and reporting activity, we have continued to monitor the instances created and tried to locate the bottlenecks in validation.

After some time testing, we have established a bi-hourly reboot of the AWS instances, trying to find a balance between downtime and good test runs validations. Right now, we have established 10 nodes for Metadata validations and they are wiped clean every time we reboot the instance. We delete everything on the volume and start a fresh instance every time. We found that keeping the old files caused sometimes inconsistencies in recreating the BaseX database, leading to the same exception mentioned above, so we resorted to cleaning everything. This also reduced the downtime, given that the database had nothing to rebuild.

We have been extracting sample data from TestRun initializations and comparing it to the incoming traffic in AWS, to check if the generation of TestRuns from the JRC premises was correlated with what we saw reflected in the logs.

We show here a sample figure for number of TestRuns initialized by minute, after a restart of the instance

testrun_count

As you can see, after one hour, the number decreases in half. We also present here an incoming traffic graph for the same period and machine.

networkpackets_in

The figures are similar, so the generation of TestRuns are correlated to the initializations in the ETF as expected. The workflow for executing validations checks the thread pool everytime to make sure the new TestRun will be executed, so the queue could be full longer time and slowing the validation rate.

We checked the metrics for DiskUsage, but we don't see a degradation on I/O operations and the burst limit for AWS volumes remains the same, so the cloud provider doesn't seem to be throttling anything.

We are also working in a visualization for initialization time vs completion time for test runs, using the API call /v2/TestRuns and extracting the log files for each TestRun, to check if the bottle neck is located on operations performed by the TestDrivers during validation.

carlospzurita commented 4 years ago

We would like to update the issue with some more feedback about the ETF operation. We keep seeing this exceptions from BaseX, both in runtime and under the restart of the service at midnight.

Caused by: de.interactive_instruments.exceptions.StorageException: org.basex.core.BaseXException: Improper use? Potential bug? Your feedback is welcome:
Contact: basex-talk@mailman.uni-konstanz.de
Version: BaseX 8.6.7
Java: Oracle Corporation, 1.8.0_66-internal
OS: Linux, amd64
Stack Trace: 
java.lang.ArrayIndexOutOfBoundsException
    at org.basex.io.random.TableDiskAccess.insert(TableDiskAccess.java:347)
    at org.basex.data.Data.insert(Data.java:912)
    at org.basex.data.Data.insert(Data.java:803)
    at org.basex.query.up.atomic.Insert.apply(Insert.java:44)
    at org.basex.query.up.atomic.AtomicUpdateCache.applyUpdates(AtomicUpdateCache.java:297)
    at org.basex.query.up.atomic.AtomicUpdateCache.execute(AtomicUpdateCache.java:281)
    at org.basex.core.cmd.Add$1.run(Add.java:64)
    at org.basex.core.cmd.ACreate.update(ACreate.java:90)
    at org.basex.core.cmd.Add.run(Add.java:56)
    at org.basex.core.Command.run(Command.java:257)
    at org.basex.core.Command.execute(Command.java:93)
    at org.basex.core.Command.execute(Command.java:116)
    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxWriteDao.doMarshallAndAdd(AbstractBsxWriteDao.java:147)
    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxWriteDao.add(AbstractBsxWriteDao.java:82)
    at de.interactive_instruments.etf.webapp.controller.TestObjectController.initResourcesAndAdd(TestObjectController.java:449)
    at de.interactive_instruments.etf.webapp.controller.TestRunController.start(TestRunController.java:497)
    at sun.reflect.GeneratedMethodAccessor343.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:205)
    at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:133)
    at org.springframework.web.servlet.mvc.method.annotation.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:97)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:827)
    at org.springframework.web.servlet.mvc.method.annotation.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:738)
    at org.springframework.web.servlet.mvc.method.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:85)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:967)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:901)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970)
    at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:872)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:821)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1685)
    at de.interactive_instruments.etf.webapp.filter.ApiFilter.doFilterInternal(ApiFilter.java:89)
    at org.springframework.web.filter.OncePerRequestFilter.doFilter(OncePerRequestFilter.java:107)
    at org.springframework.web.filter.DelegatingFilterProxy.invokeDelegate(DelegatingFilterProxy.java:347)
    at org.springframework.web.filter.DelegatingFilterProxy.doFilter(DelegatingFilterProxy.java:263)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1158)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1090)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:119)
    at org.eclipse.jetty.server.Server.handle(Server.java:517)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:242)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:261)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:75)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:213)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:147)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
    at java.lang.Thread.run(Thread.java:745)

    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxWriteDao.doMarshallAndAdd(AbstractBsxWriteDao.java:151)
    at de.interactive_instruments.etf.dal.dao.basex.AbstractBsxWriteDao.add(AbstractBsxWriteDao.java:82)
    at de.interactive_instruments.etf.webapp.controller.TestObjectController.initResourcesAndAdd(TestObjectController.java:449)
    at de.interactive_instruments.etf.webapp.controller.TestRunController.start(TestRunController.java:497)
    ... 46 common frames omitted

This often results in a complete crash of the system, and even a restart can't solve it without wiping clean all the filesystem. An instance of this is a log from the past day 21st of April that we are attaching. production20200421.zip

Do you have any lead to look for a way to prevent this from happening, or at least make it solvable by simply restarting the service?

Thank you.

jonherrmann commented 2 years ago

EIP #90 should help fix or mitigate the problem. Closing this issue.