Closed tesshucom closed 4 years ago
Fundamental solutions require flow of streamlining. And because it's degrated, you'll need more elaborate testing.
If you value the integrity of the data user use, you may want to prioritize the release of patches that remove the initial scan, and do different phase for fundamental solution. I think this should be decided by Airsonic.
In very limited assumptions, there is a risk that dirty data will be generated in the search index in cases where the music library is replaced and the main unit is replaced (initial startup) after performing a scan. Therefore, at the time of patch release, it is desirable that the data version of IndexManager be incremented at the same time.
Memo of solution
This is my opinion. Not mandatory.
The cause is an automatic scan at initial startup.
The following methods can be considered as common actions
Or
My clones have more additional meta-processing features, so there are more meta-processing tests than Airsonic. Still, I feel that the latter method does not have any major problems.
Of course, in this case, multiple reviewers confirmations are needed, and we need to get a commitment that #1402 will not recur.
This is optional, but UT maintenance is possible. Search testing has the following background:
Lusene's UTs will be merged at different times, so the master tests will need some modification. Unrelated parts should be rewritten.
If we do not want to keep it, we can delete it. However, it is difficult to guarantee operation only by review.
When sticking to this specification, it is necessary to rewrite it to proper hook processing. Or reconsider whether it is so necessary now.
You will probably need to organize issues and divide tasks.
It can be included in the 10.6.0 release plan, or can be patched with 10.5.x only for those parts that can be expedited.
The reason for rushing the repair is to reduce the number of compound bug reports such as #1437. Verification of 10.5.0 is currently difficult.
Thank you for doing the work and examining causes and potential solutions, お疲れさま!
Remove issues from #1376 and #1262 and ensure operation.
I prefer it that way. Here are a few additional comments:
DependsOn
annotation from the daoHelper
bean to the liquibase
bean? Would that solve part of our issues with MediaScannerService
being called at an inconvenient time while the DB is still initializing?IndexManager
bean is initialized? If other dependent beans rely on an usable index it'll be much more consistent if we guarantee it instead of checking "if index exists" every time we want to use it.These would be my solutions of choice for the short term, and should fix #1402 in particular.
Regarding the rest:
Lusene's UTs will be merged at different times, so the master tests will need some modification. Unrelated parts should be rewritten.
What do you mean exactly by this?
If we do not want to keep it, we can delete it. However, it is difficult to guarantee operation only by review.
Exactly, and that's my main worry about removing tests. I would rather do the work to ensure that they pass consistently (or skip them temporarily if we need it), but not remove them.
Test maintenance is a big subject, about which I unfortunately do not know much. I will need to do some research to have a better idea of what we can do.
I think there are a few confusions here, the language might be a bit hard to understand.
First, #1376, at least, doesn't have much to do with scan kick off at startup, it's only orchestrating the regularly scheduled scans, and even then it's purely replacing legacy APIs (TimerTask) with modern ones (ScheduledExecutors) with no change in functionality. It is wholly irrelevant to the problem description being stated (STARTUP scan).
Part of the confusion might be due to this misunderstanding:
As a result, scans can be performed in sub thread apart from the database initialization thread.
This was already happening prior to #1376 (TimerTask was being run in a separate daemon thread: https://github.com/airsonic/airsonic/pull/1376/files#diff-44b0964f8fb27f359ad54d7377ef98b7L88). So again, #1376 seems irrelevant to me. Furthermore, I don't know what this statement means:
1376 was not properly reviewed and verified. It is an extension of #1376.
It was reviewed and approved (as the PR log shows).
Second, the issue being pointed out here doesn't seem to be about DB (as in HSQL) initialization, but the Index (as in Lucene) initialization. While adding DependOn
, as @fxthomas suggested is at the very least a harmless idea (might not be necessary, given that SpringLiquibase is an InitializingBean), it is a separate concern, and I don't think it addresses the Index race condition at all.
Note here, however, that for this case specifically, the exception is EXPECTED. Logging it might cause alarms, but the application recovers from it just fine, and is in fact depending on it to see whether the index exists or not (and thus whether to start a scan at startup).
Other scenarios may or may not be so benign and I can't really say without seeing a full list of issues being caused.
So in short, in my opinion, there are a few solutions depending on the problem type, and it might be helpful to actually segregate the issues to proceed with solutions:
@fxthomas's proposal to materialize an empty index at bean initialization is an option also, but will need some changes as it will directly cause the the startup scan to not work because it EXPECTS the index to NOT be there in order to run.
(However because the corrections were close, probrem could be detected if the review was more careful at #1262 #1376.)
In fact, the issue is the timing and place of the scan hook.
Search might throw exceptions while scan is in progress, but does it behave fine once the scan is finished?
This cannot be guaranteed. This is because the UT to guarantee operation does not work. The problem is not the log size bloat and impression by appearance, but the broken implementation / test cycle and the inability to verify subsequent related defects. A problem may or may not exist, but a condition where we do not know is malignant.
Scans have known bugs. Do not judge that accumulating more problems is benign. UT exists to guarantee search behavior and to narrow down the source of bugs related to scanning and searching.
there are a few solutions depending on the problem type
Exactly so.
1262 is more robust in designs using DB or ehcache.
Writing values to an asynchronous Lucene and using it for transactional flows is a risk.
Scanning, creating index is originally not essential for booting the system. Earlier versions were implemented so that no exception would be raised if these were not done.
This means that there is no need to hardcode as an integral part of the boot processing block to trigger a scan. As pointed out by randomnicode, the fact that scan kick is described in MediaScannerService itself is a conflict.
One suggestion. If the discussion is prolonged or you need time to verify the method, consider roll back or a partial release that removes the problem. It's not a good sign that the master looks like a nightly build.
1376 does not claim to be the direct cause.
(However because the corrections were close, probrem could be detected if the review was more careful at #1262 #1376.)
In fact, the issue is the timing and place of the scan hook.
Again #1376 has nothing to do with the startup scan hook. The line to kick off the startup scan is below the changes made by #1376 .
You can revert #1376, but it won't solve any problems. #1376 is the exact functional equivalent of the master before that, i.e. master before and after #1376 behaves exactly in the same way (a separate thread is scheduled to execute a timed task repeatedly).
This cannot be guaranteed. This is because the UT to guarantee operation does not work. The problem is not the log size bloat and impression by appearance, but the broken implementation / test cycle and the inability to verify subsequent related defects. A problem may or may not exist, but a condition where we do not know is malignant.
I assume UT means unit testing? (apologies if it doesn't). I guess I have trouble understanding what the issue is about. If this issue is about adding more tests, then sure, that's a good idea. But the existing condition isn't, strictly speaking, "unknown". The aforementioned exception thrown is expected and known, it is in fact part of the logic used to kick off the starting scan. You can suppress the exception if it causes alarms.
Writing values to an asynchronous Lucene and using it for transactional flows is a risk.
I guess I don't see what transactional flows we're using it for. Sure, we certainly have synchronous requests coming in, but that's just the nature of the request cycle, and this is a common way when dealing with sync/async boundaries especially at the request level. Your options are either:
You're always going to have this issue when searching, even when the startup scan is disabled because the scan is always async and you're always querying synchronously.
Scanning, creating index is originally not essential for booting the system. Earlier versions were implemented so that no exception would be raised if these were not done.
Okay, then revert #1262 to not store statistics in the index so you don't have to read it from there (and consequently throw exceptions when it's not there). Store the statistics in the airsonic.properties file (as before), or, alternatively, in the db. @muff1nman seemed to explicitly choose the index for whatever reason that he can explain. Alternatively, retain the change and suppress the exception in the logs. The startup scan exception issue, again, in my opinion, seems like not a big issue. It's a benign expected exception that is part of the logic flow.
For what it's worth, in my opinion, I agree with @fxthomas in that I think we should try to retain the first-time startup scan.
Okay, then revert #1262 to not store statistics in the index so you don't have to read it from there (and consequently throw exceptions when it's not there).
It is one of the effective and reliable measures. At least less degraded.
For what it's worth, in my opinion, I agree with @fxthomas in that I think we should try to retain the first-time startup scan.
I am neutral about this. I recognize that there are some barriers, and I'm not opposed to auto-scan itself.
I don't consider it a critical feature to be introduced, even taking the risk of breaking existing impl.
With cases like #1340, you should also consider cases where the lock can occur for hours without notice. It is not a simple improvement, but a feature that may or may not be convenient for the user.
Also some Unit Tests will not work. At least verification before the introduction must be regarded as lacking. Most search tests have the following pre-processing steps:
That is, the normal operation until now of the user is simulated.
If you stick to the first scan, you need to consider the testing method. (Of course, it is necessary.)
You need to modify your tests for the new launch method or introduce a method that allows you to deploy existing tests without modification.
For implementations that are difficult to test, consider the possibility that the functional partitioning design may be inadequate. The design of hard-coding a forced scan at startup is relatively retroactive concept.
If it can be turned ON / OFF as an option, relatively soft landing is possible.
Default is OFF if you consider previous operation specifications. In that case, it does not affect existing tests and you only need to add one case where the automatic scan will work. If both are guaranteed to operate normally, it is also possible to ship a script that turns on automatic scanning in the standard Docker startup script.
:point_up: This is an example that won't surprise existing users. Of cource, ON / OFF policy can be reversed.
So far I can see that we have the following things to fix (let me know if you are okay with editing the first post to put this in a more visible place):
DependsOn
?) (#1460)Just a few additional comments.
Scanning, creating index is originally not essential for booting the system. Earlier versions were implemented so that no exception would be raised if these were not done.
I looked into it a bit more : apparently scanning at startup was also present in Subsonic (or at least the lines that start it date back from it). So it's not new, merely not very well documented/tested?
Is it also be triggered when upgrading the index?
given that SpringLiquibase is an InitializingBean
This means that it runs something once its properties are set, but I was rather thinking about how you could start making queries before the migrations are done (because the DAO beans are brought up concurrently with Liquibase).
The startup scan exception issue, again, in my opinion, seems like not a big issue.
It's fairly small, but I'd like to fix it in order to avoid error reports.
I looked into it a bit more : apparently scanning at startup was also present in Subsonic (or at least the lines that start it date back from it). So it's not new, merely not very well documented/tested?
Yes. The issues involved have already been raised.
There are reports that autoscan does not work after updating lucene. No relevant implementation changes have been made. Lucene updates have been pretty carefully reviewed by many reviewers.
Revision was compared, but there was no difference in behavior before and after the update. Earlier revisions have not been investigated. I don't know when it has been.
For this reason, it is implemented so that Lucene errors are not displayed on the screen even if a scan is not executed at startup. Dedicated warning message is output to the log instead of an exception. ↓
Scanning, creating index is originally not essential for booting the system.
This is what that means. It was implemented and tested to work without problems with very close implementation specifications.
There are old legacy search tests apart from the tests added during the Lucene Update. These tests also do not assume automatic scanning. All search unit tests explicitly kick scans.
This means that it runs something once its properties are set, but I was rather thinking about how you could start making queries before the migrations are done (because the DAO beans are brought up concurrently with Liquibase).
I think (and it's worth fact-checking me) that startup is inherently a single threaded or at least a concurrent affair, otherwise you might get multiple beans of the same type, especially while autowiring singletons. So I think that rules out DAO beans being brought up concurrently with Liquibase. One bean is instantiated at one time, everyone else waits.
If the instantiation delegates internally to another thread and returns control flow to the factory, then I'm not sure how you can solve this using @DependsOn
either.
Second, DAO beans cannot start querying the DB without liquibase because, at least in Spring Boot 2, SpringLiquibase bean has a PostProcessor for ensuring that JDBC Operations are dependent on Liquibase. Flyway uses high ordering instead for InitializingBean. I'm sure it's something similar in Spring Boot 1.
However, admittedly, we do compound our issues because we register a custom SpringLiquibase class which might not behave with the auto orchestration processes of Spring Boot. It's one of the reasons why I prioritized removing the custom SpringLiquibase class.
@muff1nman seemed to explicitly choose the index for whatever reason that he can explain.
My thought was in the case people do things out of band (i.e. remove the index directory, copy in an old index directory, restore a db, etc, storing the statistics in the index will ensure consistency during these operations.
@tesshucom The most important things here for normal operation have been closed off.
After looking at it, the only place where a scan is started during tests is TestCaseUtils.execScan
, which explicitly waits for the scan to finish. I believe this should be safe, at least for now.
Do you mind if I close the issue?
I really appreciate your great support.
I believe this should be safe, at least for now.
Similar recognition. In the sense that it is not a realistic threat.
Strictly speaking, the automatic scan should be stopped when running the test, but it is not a major problem without overload.
Some problems have been exposed when travis previously caused overloading problems. Currently, the risk of overload itself has been reduced by #1242 etc.
Great, then I'm closing this!
Problem description
Airsonic 10.5.0 included a process to change the flow. As a result, scans can be performed in sub thread apart from the database initialization thread.
If a failure occurs, the scan will not be performed correctly. Due to this bug, it is difficult to isolate the cause of the search bug that has occurred since 10.5.0.
Steps to reproduce
1402 points out the most primitive and precise case.
You can also check with the following command.
mvn clean test -DfailIfNoTests=false -Dtest=org/airsonic/player/service/search/*.java
Existing search test cases are based on legacy UTs. Legacy search UT explicitly inputs data before execution, then executes scan.
This doesn't work with current masters. The scan is performed only once. However, it is not the scan controlled by the UT, but the initial scan of the Airsonic itself. As a result, the scan ends with no errors, but no search is possible.
System information
Additional notes
Legacy's startup flow is originally insufficient and partially fazy implemented. (But at least it was working)
The most relevant issues are #1376 and #1262. I do not greatly oppose these. (Although #1262 is more robust in design with DB or ehcache ...)
However, in order to apply these at same time, it is necessary to redesign the startup flow. It is an design error that the timing of the first scan is not the completion of DB initialization, but is left to generation timing of MediaScannerService instance.
Strictly speaking, REST / WEB screen / schedule must not kick scan during DB initialization or scan.
Solution example
https://github.com/tesshucom/jpsonic/commit/299a87679161fc065fa90cea2d38fcd338f94e73
Kick for DB initialization / scheduler registration / first scan. If you do not roll back, these execution order should be redesigned.