airsonic / airsonic

:satellite: :cloud: :notes:Airsonic, a Free and Open Source community driven media server (fork of Subsonic and Libresonic)

https://airsonic.github.io

GNU General Public License v3.0

2.01k stars 238 forks source link

Server startup flow is broken #1444

Closed tesshucom closed 4 years ago

tesshucom commented 4 years ago

Problem description

Airsonic 10.5.0 included a process to change the flow. As a result, scans can be performed in sub thread apart from the database initialization thread.

If a failure occurs, the scan will not be performed correctly. Due to this bug, it is difficult to isolate the cause of the search bug that has occurred since 10.5.0.

Steps to reproduce

1402 points out the most primitive and precise case.

You can also check with the following command.

mvn clean test -DfailIfNoTests=false -Dtest=org/airsonic/player/service/search/*.java

Existing search test cases are based on legacy UTs. Legacy search UT explicitly inputs data before execution, then executes scan.

This doesn't work with current masters. The scan is performed only once. However, it is not the scan controlled by the UT, but the initial scan of the Airsonic itself. As a result, the scan ends with no errors, but no search is possible.

System information

Airsonic version: Airsonic 10.5.0
Operating system: All
Java version: All
Proxy server: All
Client: All
Language: All

Additional notes

Legacy's startup flow is originally insufficient and partially fazy implemented. (But at least it was working)

The most relevant issues are #1376 and #1262. I do not greatly oppose these. (Although #1262 is more robust in design with DB or ehcache ...)

However, in order to apply these at same time, it is necessary to redesign the startup flow. It is an design error that the timing of the first scan is not the completion of DB initialization, but is left to generation timing of MediaScannerService instance.

Strictly speaking, REST / WEB screen / schedule must not kick scan during DB initialization or scan.

Solution example

Delete scan kick of the scheduler post-process.
This disables the first scan. The user must log in and take a scan kick. Still, it's trivial compared to the negative consequences of disability.
A short-term response is not a provisional response, but a part of what should be done finally.

https://github.com/tesshucom/jpsonic/commit/299a87679161fc065fa90cea2d38fcd338f94e73

Kick for DB initialization / scheduler registration / first scan. If you do not roll back, these execution order should be redesigned.

tesshucom commented 4 years ago

Fundamental solutions require flow of streamlining. And because it's degrated, you'll need more elaborate testing.

If you value the integrity of the data user use, you may want to prioritize the release of patches that remove the initial scan, and do different phase for fundamental solution. I think this should be decided by Airsonic.

In very limited assumptions, there is a risk that dirty data will be generated in the search index in cases where the music library is replaced and the main unit is replaced (initial startup) after performing a scan. Therefore, at the time of patch release, it is desirable that the data version of IndexManager be incremented at the same time.

tesshucom commented 4 years ago

Memo of solution

This is my opinion. Not mandatory.

Controlling obstacles

The cause is an automatic scan at initial startup.

The following methods can be considered as common actions

Revert #1376 and #1262. Rollback does not take long to verify.

Remove issues from #1376 and #1262 and ensure operation. Since the problem part is clear, only that part is removed and other operations are checked. This does not mean that new work will occur. #1376 was not properly reviewed and verified. It is an extension of #1376.

My clones have more additional meta-processing features, so there are more meta-processing tests than Airsonic. Still, I feel that the latter method does not have any major problems.

Of course, in this case, multiple reviewers confirmations are needed, and we need to get a commitment that #1402 will not recur.

Relapse prevention

This is optional, but UT maintenance is possible. Search testing has the following background:

1163 Undesirable behavior in certain environments.
1186 Temporarily had set skip when travis overload failure. In other words, currently only manual startup.
1242 Changed tests to single forked mode. Probably from this point, Lusene's UT can be stabilized. That we can eliminate skips.

Lusene's UTs will be merged at different times, so the master tests will need some modification. Unrelated parts should be rewritten.

If we do not want to keep it, we can delete it. However, it is difficult to guarantee operation only by review.

Reconsideration of scan specifications at startup

When sticking to this specification, it is necessary to rewrite it to proper hook processing. Or reconsider whether it is so necessary now.

Some media servers may perform such processing. However, generally, a process involving a long-term lock is started by the user arbitrarily, or a secure cancel process is ensured.
Scanning usually works. On the other hand, edge case failures have been reported. Concerns about the risk of automatically performing unconfirmed processing at startup and increasing the difficulty of verification
At least #1376 does not consider coexistence with existing UT. When changing the server-starting architect, the UT must be rewritten. Alternatively, it can be introduced with the default OFF option to enable disconnection from existing verifications. There should be other ways to reduce the risk, for example, to make it easier to rewrite the default ON / OFF in Docker files.

You will probably need to organize issues and divide tasks.

It can be included in the 10.6.0 release plan, or can be patched with 10.5.x only for those parts that can be expedited.

The reason for rushing the repair is to reduce the number of compound bug reports such as #1437. Verification of 10.5.0 is currently difficult.

fxthomas commented 4 years ago

Thank you for doing the work and examining causes and potential solutions, お疲れさま!

Remove issues from #1376 and #1262 and ensure operation.

I prefer it that way. Here are a few additional comments:

How about adding a DependsOn annotation from the daoHelper bean to the liquibase bean? Would that solve part of our issues with MediaScannerService being called at an inconvenient time while the DB is still initializing?
How about initializing an empty index when the IndexManager bean is initialized? If other dependent beans rely on an usable index it'll be much more consistent if we guarantee it instead of checking "if index exists" every time we want to use it.
I'd to keep the fact index scan on first startup ideally, if the solutions above work it should be safe.

These would be my solutions of choice for the short term, and should fix #1402 in particular.

Regarding the rest:

Lusene's UTs will be merged at different times, so the master tests will need some modification. Unrelated parts should be rewritten.

What do you mean exactly by this?

If we do not want to keep it, we can delete it. However, it is difficult to guarantee operation only by review.

Exactly, and that's my main worry about removing tests. I would rather do the work to ensure that they pass consistently (or skip them temporarily if we need it), but not remove them.

Test maintenance is a big subject, about which I unfortunately do not know much. I will need to do some research to have a better idea of what we can do.

randomnicode commented 4 years ago

I think there are a few confusions here, the language might be a bit hard to understand.

First, #1376, at least, doesn't have much to do with scan kick off at startup, it's only orchestrating the regularly scheduled scans, and even then it's purely replacing legacy APIs (TimerTask) with modern ones (ScheduledExecutors) with no change in functionality. It is wholly irrelevant to the problem description being stated (STARTUP scan).

Part of the confusion might be due to this misunderstanding:

As a result, scans can be performed in sub thread apart from the database initialization thread.

This was already happening prior to #1376 (TimerTask was being run in a separate daemon thread: https://github.com/airsonic/airsonic/pull/1376/files#diff-44b0964f8fb27f359ad54d7377ef98b7L88). So again, #1376 seems irrelevant to me. Furthermore, I don't know what this statement means:

1376 was not properly reviewed and verified. It is an extension of #1376.

It was reviewed and approved (as the PR log shows).

Second, the issue being pointed out here doesn't seem to be about DB (as in HSQL) initialization, but the Index (as in Lucene) initialization. While adding DependOn, as @fxthomas suggested is at the very least a harmless idea (might not be necessary, given that SpringLiquibase is an InitializingBean), it is a separate concern, and I don't think it addresses the Index race condition at all.

1262 might be a culprit for the log exceptions, because it seems to introduce a dependency on MediaScannerService to read the index (it introduces the storage of the Statistics in the index itself, thus creating a circular dependency; previously it only wrote there, Statistics were stored elsewhere). So, for instance, for a new installation, I could see the following sequence easily resulting in an exception being logged:

MediaScannerService instantiated and needs to check if a scan has been done before
It calls IndexManager to retrieve the Statistics (which holds the scan date)
IndexManager sees no index files and throws an exception, which gets logged, and returns null
MediaScannerService gets a null for the Statistics and assumes (rightfully) it doesn't exist, and proceeds to scan.

Note here, however, that for this case specifically, the exception is EXPECTED. Logging it might cause alarms, but the application recovers from it just fine, and is in fact depending on it to see whether the index exists or not (and thus whether to start a scan at startup).

Other scenarios may or may not be so benign and I can't really say without seeing a full list of issues being caused.

So in short, in my opinion, there are a few solutions depending on the problem type, and it might be helpful to actually segregate the issues to proceed with solutions:

Issues where the exception is expected and benign: if the application is expecting the exception and recovers from it gracefully, is it an issue? If the logging is not desirable, we could peg it down to a lower log level. All it seems to be doing is causing alarms.
Issues with a race condition, like maybe trying to do search before the index has materialized. Does the application recover? Search might throw exceptions while scan is in progress, but does it behave fine once the scan is finished? We could add a UI indicator mentioning that the scan is currently under way and therefore search is not currently possible (or will yield old results)?

@fxthomas's proposal to materialize an empty index at bean initialization is an option also, but will need some changes as it will directly cause the the startup scan to not work because it EXPECTS the index to NOT be there in order to run.

tesshucom commented 4 years ago

1376 does not claim to be the direct cause.

(However because the corrections were close, probrem could be detected if the review was more careful at #1262 #1376.)

In fact, the issue is the timing and place of the scan hook.

Search might throw exceptions while scan is in progress, but does it behave fine once the scan is finished?

This cannot be guaranteed. This is because the UT to guarantee operation does not work. The problem is not the log size bloat and impression by appearance, but the broken implementation / test cycle and the inability to verify subsequent related defects. A problem may or may not exist, but a condition where we do not know is malignant.

Scans have known bugs. Do not judge that accumulating more problems is benign. UT exists to guarantee search behavior and to narrow down the source of bugs related to scanning and searching.

there are a few solutions depending on the problem type

Exactly so.

1262 is more robust in designs using DB or ehcache.

Writing values to an asynchronous Lucene and using it for transactional flows is a risk.

Scanning, creating index is originally not essential for booting the system. Earlier versions were implemented so that no exception would be raised if these were not done.

This means that there is no need to hardcode as an integral part of the boot processing block to trigger a scan. As pointed out by randomnicode, the fact that scan kick is described in MediaScannerService itself is a conflict.

One suggestion. If the discussion is prolonged or you need time to verify the method, consider roll back or a partial release that removes the problem. It's not a good sign that the master looks like a nightly build.

randomnicode commented 4 years ago

1376 does not claim to be the direct cause.

(However because the corrections were close, probrem could be detected if the review was more careful at #1262 #1376.)

In fact, the issue is the timing and place of the scan hook.

Again #1376 has nothing to do with the startup scan hook. The line to kick off the startup scan is below the changes made by #1376 .

You can revert #1376, but it won't solve any problems. #1376 is the exact functional equivalent of the master before that, i.e. master before and after #1376 behaves exactly in the same way (a separate thread is scheduled to execute a timed task repeatedly).

This cannot be guaranteed. This is because the UT to guarantee operation does not work. The problem is not the log size bloat and impression by appearance, but the broken implementation / test cycle and the inability to verify subsequent related defects. A problem may or may not exist, but a condition where we do not know is malignant.

I assume UT means unit testing? (apologies if it doesn't). I guess I have trouble understanding what the issue is about. If this issue is about adding more tests, then sure, that's a good idea. But the existing condition isn't, strictly speaking, "unknown". The aforementioned exception thrown is expected and known, it is in fact part of the logic used to kick off the starting scan. You can suppress the exception if it causes alarms.

Writing values to an asynchronous Lucene and using it for transactional flows is a risk.

I guess I don't see what transactional flows we're using it for. Sure, we certainly have synchronous requests coming in, but that's just the nature of the request cycle, and this is a common way when dealing with sync/async boundaries especially at the request level. Your options are either:

make search requests (or anything to do with the index) asynchronous, at the risk of massive UI and architectural design changes, like indicate "search was submitted, clients check or poll again to see results when they become available"
make search requests quasi-asynchronous, at the risk of non-functional stalled UIs/elevated request response times, such as block the request thread while polling the answer from the async interface, and return when the async returns an answer, like "search was submitted, the UI is now waiting while the backend polls the async system for whenever the answer returns" (if async process takes 30 mins, then the request is blocked for 30 mins). Of course some of these problems you can alleviate with timeouts.
continue in a sync fashion and fail fast, but indicate in the UI/response that the request may be affected by the limitations of a sync process calling an async one especially when a potential conflict exists (such as a scan is in effect).

You're always going to have this issue when searching, even when the startup scan is disabled because the scan is always async and you're always querying synchronously.

Scanning, creating index is originally not essential for booting the system. Earlier versions were implemented so that no exception would be raised if these were not done.

Okay, then revert #1262 to not store statistics in the index so you don't have to read it from there (and consequently throw exceptions when it's not there). Store the statistics in the airsonic.properties file (as before), or, alternatively, in the db. @muff1nman seemed to explicitly choose the index for whatever reason that he can explain. Alternatively, retain the change and suppress the exception in the logs. The startup scan exception issue, again, in my opinion, seems like not a big issue. It's a benign expected exception that is part of the logic flow.

For what it's worth, in my opinion, I agree with @fxthomas in that I think we should try to retain the first-time startup scan.

tesshucom commented 4 years ago

Okay, then revert #1262 to not store statistics in the index so you don't have to read it from there (and consequently throw exceptions when it's not there).

It is one of the effective and reliable measures. At least less degraded.

For what it's worth, in my opinion, I agree with @fxthomas in that I think we should try to retain the first-time startup scan.

I am neutral about this. I recognize that there are some barriers, and I'm not opposed to auto-scan itself.

I don't consider it a critical feature to be introduced, even taking the risk of breaking existing impl.

With cases like #1340, you should also consider cases where the lock can occur for hours without notice. It is not a simple improvement, but a feature that may or may not be convenient for the user.

Also some Unit Tests will not work. At least verification before the introduction must be regarded as lacking. Most search tests have the following pre-processing steps:

Starting the server
Register Music Directory that contains the data to be tested
Start scan
Start test

That is, the normal operation until now of the user is simulated.

If you stick to the first scan, you need to consider the testing method. (Of course, it is necessary.)

You need to modify your tests for the new launch method or introduce a method that allows you to deploy existing tests without modification.

For implementations that are difficult to test, consider the possibility that the functional partitioning design may be inadequate. The design of hard-coding a forced scan at startup is relatively retroactive concept.

If it can be turned ON / OFF as an option, relatively soft landing is possible.

Default is OFF if you consider previous operation specifications. In that case, it does not affect existing tests and you only need to add one case where the automatic scan will work. If both are guaranteed to operate normally, it is also possible to ship a script that turns on automatic scanning in the standard Docker startup script.

:point_up: This is an example that won't surprise existing users. Of cource, ON / OFF policy can be reversed.

fxthomas commented 4 years ago

So far I can see that we have the following things to fix (let me know if you are okay with editing the first post to put this in a more visible place):

[x] Use of index before it is fully created (w. exceptions in the logs) (#1459)
[x] Use of statistics before they are fully created (w. exceptions in the logs)
[x] Potential race condition between DAOs and Liquibase (solved by DependsOn?) (#1460)
[ ] Potentially unreliable testing for Airsonic startup
- [ ] At first initialization
- [ ] When migrating from an older version
[x] Potentailly unreliable testing when the tests require a scan

Just a few additional comments.

Scanning, creating index is originally not essential for booting the system. Earlier versions were implemented so that no exception would be raised if these were not done.

I looked into it a bit more : apparently scanning at startup was also present in Subsonic (or at least the lines that start it date back from it). So it's not new, merely not very well documented/tested?

Is it also be triggered when upgrading the index?

given that SpringLiquibase is an InitializingBean

This means that it runs something once its properties are set, but I was rather thinking about how you could start making queries before the migrations are done (because the DAO beans are brought up concurrently with Liquibase).

The startup scan exception issue, again, in my opinion, seems like not a big issue.

It's fairly small, but I'd like to fix it in order to avoid error reports.

tesshucom commented 4 years ago

I looked into it a bit more : apparently scanning at startup was also present in Subsonic (or at least the lines that start it date back from it). So it's not new, merely not very well documented/tested?

Yes. The issues involved have already been raised.

1234

There are reports that autoscan does not work after updating lucene. No relevant implementation changes have been made. Lucene updates have been pretty carefully reviewed by many reviewers.

Revision was compared, but there was no difference in behavior before and after the update. Earlier revisions have not been investigated. I don't know when it has been.

For this reason, it is implemented so that Lucene errors are not displayed on the screen even if a scan is not executed at startup. Dedicated warning message is output to the log instead of an exception. ↓

Scanning, creating index is originally not essential for booting the system.

This is what that means. It was implemented and tested to work without problems with very close implementation specifications.

There are old legacy search tests apart from the tests added during the Lucene Update. These tests also do not assume automatic scanning. All search unit tests explicitly kick scans.

randomnicode commented 4 years ago

This means that it runs something once its properties are set, but I was rather thinking about how you could start making queries before the migrations are done (because the DAO beans are brought up concurrently with Liquibase).

I think (and it's worth fact-checking me) that startup is inherently a single threaded or at least a concurrent affair, otherwise you might get multiple beans of the same type, especially while autowiring singletons. So I think that rules out DAO beans being brought up concurrently with Liquibase. One bean is instantiated at one time, everyone else waits.

If the instantiation delegates internally to another thread and returns control flow to the factory, then I'm not sure how you can solve this using @DependsOn either.

Second, DAO beans cannot start querying the DB without liquibase because, at least in Spring Boot 2, SpringLiquibase bean has a PostProcessor for ensuring that JDBC Operations are dependent on Liquibase. Flyway uses high ordering instead for InitializingBean. I'm sure it's something similar in Spring Boot 1.

However, admittedly, we do compound our issues because we register a custom SpringLiquibase class which might not behave with the auto orchestration processes of Spring Boot. It's one of the reasons why I prioritized removing the custom SpringLiquibase class.

muff1nman commented 4 years ago

@muff1nman seemed to explicitly choose the index for whatever reason that he can explain.

My thought was in the case people do things out of band (i.e. remove the index directory, copy in an old index directory, restore a db, etc, storing the statistics in the index will ensure consistency during these operations.

fxthomas commented 4 years ago

@tesshucom The most important things here for normal operation have been closed off.

After looking at it, the only place where a scan is started during tests is TestCaseUtils.execScan, which explicitly waits for the scan to finish. I believe this should be safe, at least for now.

Do you mind if I close the issue?

tesshucom commented 4 years ago

I really appreciate your great support.

I believe this should be safe, at least for now.

Similar recognition. In the sense that it is not a realistic threat.

Strictly speaking, the automatic scan should be stopped when running the test, but it is not a major problem without overload.

Some problems have been exposed when travis previously caused overloading problems. Currently, the risk of overload itself has been reduced by #1242 etc.

fxthomas commented 4 years ago

Great, then I'm closing this!

airsonic / airsonic

Server startup flow is broken #1444

Problem description

Steps to reproduce

1402 points out the most primitive and precise case.

System information

Additional notes

Solution example

Controlling obstacles

Relapse prevention

1163 Undesirable behavior in certain environments.

1186 Temporarily had set skip when travis overload failure. In other words, currently only manual startup.

1242 Changed tests to single forked mode. Probably from this point, Lusene's UT can be stabilized. That we can eliminate skips.

Reconsideration of scan specifications at startup

1376 was not properly reviewed and verified. It is an extension of #1376.

1376 does not claim to be the direct cause.

1262 is more robust in designs using DB or ehcache.

1376 does not claim to be the direct cause.

1234