Closed mreid-tt closed 1 year ago
@Diaoul, @ymartin59, any thoughts on this?
EDIT: So apparently, the actual error in the application log was this:
[2023-04-25 09:50:00 +0000] [8] [ERROR] Error handling request /packages
Traceback (most recent call last):
File "/usr/src/app/spkrepo/views/api.py", line 264, in post
build.save(spk.stream)
File "/usr/src/app/spkrepo/models.py", line 357, in save
with io.open(
FileNotFoundError: [Errno 2] No such file or directory: '/data/lidarr/11/lidarr.v11.f42661[comcerto2k].spk'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 136, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2091, in __call__
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.9/site-packages/werkzeug/middleware/proxy_fix.py", line 187, in __call__
return self.app(environ, start_response)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2076, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 271, in error_router
return original_handler(e)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 271, in error_router
return original_handler(e)
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
File "/usr/src/app/spkrepo/views/api.py", line 47, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 467, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/flask/views.py", line 84, in view
return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
resp = meth(*args, **kwargs)
File "/usr/src/app/spkrepo/views/api.py", line 278, in post
abort(500, message="Failed to save files", details=e.message)
AttributeError: 'FileNotFoundError' object has no attribute 'message'
Based on the above the actual cause of the (500) Internal Server Error was an unhandled exception for FileNotFoundError
. This I've authored a PR #96 to address.
This leaves the question of the original issue of the uploaded file from the stream not being present at the expected path during publishing. From examining the server code my understanding of the publishing process is as follows:
Now the error seen: FileNotFoundError: [Errno 2] No such file or directory: '/data/lidarr/11/...
suggests that in attempting to save the named spk file, it is unable to do so. It is possible that while the path /data/lidarr/
does exist (since there are other versions present), the version sub-directory does not. A missing sub-directory may cause the save to fail.
I suspect that there may have been some sort of corruption when the first package from lidarr v11 was uploaded. The admin interface reports the following:
However, when we look at the webpage it shows the following:
This version should not be visible since it is disabled in the admin interface. The application icon in the webpage was also not visible (before v12 was uploaded), which suggests that the code which runs when the new Version object is present was not initially run.
I was publishing some other packages and I see the same corruption happening with prowlarr v6. The admin interface reports the following:
However, the webpage at https://synocommunity.com/package/prowlarr
shows the following:
Note the missing icon as I indicated above with lidarr v11 when no newer version is present. I believe this happens because this code is not being executed:
Yes there must've been a failure at a time that did create the version but not the appropriate directories, causing subsequent ones to fail. Maybe the best move is to delete the version and re-upload everything?
Yes there must've been a failure at a time that did create the version but not the appropriate directories, causing subsequent ones to fail. Maybe the best move is to delete the version and re-upload everything?
Thanks for your feedback. Before deleting the version can you check:
FileNotFoundError: [Errno 2] No such file or directory
errors/data/prowlarr/6/
exists. If so, is anything in that directory?If the checks above confirms the theory then I’d request that the prowlarr v6 be deleted. I don’t think I have permissions to do that.
hey @Diaoul, any feedback on the above?
In the meantime, would you be so kind as to delete this version:
@mreid-tt I have deleted lidarr v11 for you, please let me know of any others that you like to delete
thanks @publicarray, I was hoping we could check the application logs and the file system for prowlarr to confirm the theory above. But if this is a hassle we can just delete the problematic version:
@mreid-tt the folder was never created...
publicarray@vps-b34822e4 /e/synocommunity> ls data/prowlarr
4/ 5/
I have deleted the broken version
We had problems uploading before (due to load) but never like this (missing folders/files)...
@mreid-tt the folder was never created...
publicarray@vps-b34822e4 /e/synocommunity> ls data/prowlarr 4/ 5/
Thanks for the confirmation.
I have deleted the broken version
Great! I'll proceed with a fresh build and upload.
We had problems uploading before (due to load) but never like this (missing folders/files)...
Hmm, okay. Perhaps a possible fix would be to adjust the code to verify that a folder is created (for both the package and version functions) before committing a new entry to the database.
my2 cents but maybe only upload one file at a time. IIRC phython/postgresql are bad at working with multiple upload requests at the same time. This smells like a race condition
my2 cents but maybe only upload one file at a time. IIRC phython/postgresql are bad at working with multiple upload requests at the same time. This smells like a race condition
That makes sense but with a GitHub build how would we do that? All the archs build concurrently.
I have deleted the broken version
Can you confirm this was done? I'm still seeing this:
hm maybe build them without publishing on github and download the result. Then manually create the upload request with httpie. I know it's a workaround but we can also test if, for example you upload all packages at once if this causes this problem again.
Maybe your page is cached. Edit: sorry I'm not sure, but I deleted v6 again, and it is now gone.
I think you may be correct. I did the new run and got this in one of the logs:
Package published successfully
{"package": "prowlarr", "version": "20230427-6", "firmware": "6.1-15047", "architectures": ["apollolake", "avoton", "braswell", "broadwell", "broadwellnk", "broadwellntbap", "bromolow", "cedarview", "denverton", "dockerx64", "geminilake", "grantley", "purley", "kvmx64", "v1000", "x86", "x86_64"]}
But then right after, I get the following in all the other archs:
ERROR: Failed to publish package - HTTP response code 500
We may have the same situation again. If you can, please remove prowlarr v6 again because I suspect we have re-created the same situation. I will try a manual upload one file at a time once you have.
Edit: sorry I'm not sure, but I deleted v6 again, and it is now gone.
Okay, thanks. I'll retry, one at a time now.
we got one :)
publicarray@vps-b34822e4 /e/synocommunity> ls -l data/prowlarr/6
total 55524
-rw-r--r-- 1 root root 44917 Apr 27 12:25 icon_256.png
-rw-r--r-- 1 root root 7211 Apr 27 12:25 icon_72.png
-rw-r--r-- 1 root root 56801280 Apr 27 12:25 'prowlarr.v6.f15047[apollolake-avoton-braswell-broadwell-broadwellnk-broadwellntbap-bromolow-cedarview-denverton-dockerx64-geminilake-grantley-purley-kvmx64-v1000-x86-x86_64].spk'
Sending one at a time worked! All archs successful: https://github.com/mreid-tt/spksrc/actions/runs/4819585180
EDIT: I'll look into making a PR to verify that the folder has been created for both package and version. Running parallel processes does not seem to be an issue once the first one correctly creates the folder.
Hi @Diaoul, now that we have conclusively proven that the issue is rooted in the version folder not being created, this suggests that this code block is not executed:
If this is the case, how could we make the function more resilient? Should the os.mkdir()
be put into a try block to catch any exceptions? Should we have a line right afterwards to verify that the path exists? The worry I have is that it thinks the command was successful. This is further confirmed in the log which reports the following back in GitHub:
Package published successfully
{"package": "prowlarr", "version": "20230426-6", "firmware": "6.1-15047", "architectures": ["evansport"]}
Why might this be the case? Are the filesystem actions for directory creation and build.save()
being executed asynchronously and never complete even though the function believes it was executed? For the sequence of events, when exactly is the new version added to the database?
If it is not asynchronous, are multiple function calls trying to create the same version when GitHub runs parallel build operations? If so, why wouldn't at least one function call be completed? That way even if the other parallel calls fail (or do nothing because the folder exists), the successful one should create the folder, shouldn't it?
Interested in your thoughts before I create a PR so that the solution is effective.
My guess is that if a version exist in the db, but the transaction isn't finished yet and another one is created at the same time we get this condition. Not sure why not at least one mkdir command is executed though. Maybe both transactions exit early because a version exists already?
I also noted that if there is an exception, files are removed so what happens when a file is created and at the same time removed by a different request?
I also noted that if there is an exception, files are removed so what happens when a file is created and at the same time removed by a different request?
Ahh! That's a good point, you are referring to this part of the code:
So, it's possible it was created just fine but there was some sort of exception (perhaps even in creating the icons or saving the build), then the entire version folder would be dumped. If this is the case then the version should also be removed from the database (or not added if this is done later). Need to better understand the database interactions to confirm.
Yea. Imagine 2 requests. Both don't see a version so they both try to create one on the BD. The db will reject one as it already exist (consistency in the db). So one succeeds and one fails (creating an exception) removing the folder but the DB is updated to have a version. Just a theory.
Filtering the exception for 'version already exist' and not removing files should prevent this.
So I shared our problem with ChatGPT and it suggested some approaches which I have included in a proposed fix https://github.com/SynoCommunity/spkrepo/pull/100. I would appreciate a review in that PR.
Thinking a bit about this, I am not sure why we need a clean up the directory structure :shrug: We could very well make sure the path always exists. However, this could hide another consistency problem (like in #100)
I think I agree with @Diaoul here, using locks just overcomplicates things. I think just checking if there is an exception like "version already exists" and continuing the code anyway rather than cleaning up the folders is enough in this case. I'd like to mention that what I think is happening is a hypothesis and not proven yet.
There could have been an issue with the redis or database connection. Unfortunately, there aren't time codes in the app logs. (unlikely the cause since the version exists in the database)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
Is the server running on host "db" (172.18.0.3) and accepting
TCP/IP connections on port 5432?
redis.exceptions.ConnectionError: Error while reading from redis:6379 : (104, 'Connection reset by peer')
Exception possibly due to cache backend.
edit: never mind, I can get the timestamp from docker.
Thanks for the additional perspectives. Taking a step back, I believe the issue is really about the filesystem and not the database. Re-looking at the code we have these operations to create folders:
My understanding is that if the folder already exists, it will raise a FileExistsError
. This is another possible reason for the try block failing and a more likely scenario for multiple workers adding a new build. I've done another PR with a far simpler fix for consideration in #101.
Thinking a bit about this, I am not sure why we need a clean up the directory structure 🤷 We could very well make sure the path always exists.
I actually like the elegance of cleaning up the directory structure on an exception so you maintain a clean filesystem. We could adjust the code to something like this:
try:
data_path = current_app.config["DATA_PATH"]
dir_path = os.path.join(data_path, package.name)
if not os.path.exists(dir_path):
os.mkdir(dir_path)
But I really believe that the exception on the try block is more likely caused by the folder already existing which can easily be mitigated with the existing code structure.
@publicarray, looking back at the application logs you kindly provided, I believe this new theory is correct. In the logs we first see this error:
[2023-04-16 12:32:53 +0000] [7] [ERROR] Error handling request /packages
Traceback (most recent call last):
File "/usr/src/app/spkrepo/views/api.py", line 261, in post
os.mkdir(os.path.join(data_path, package.name, str(version.version)))
FileExistsError: [Errno 17] File exists: '/data/lidarr/11'
Immediately after that we have these errors and similar:
[2023-04-16 12:33:40 +0000] [10] [ERROR] Error handling request /packages
Traceback (most recent call last):
File "/usr/src/app/spkrepo/views/api.py", line 264, in post
build.save(spk.stream)
File "/usr/src/app/spkrepo/models.py", line 357, in save
with io.open(
FileNotFoundError: [Errno 2] No such file or directory: '/data/lidarr/11/lidarr.v11.f42661[comcerto2k].spk'
I believe this supports the theory that:
I am not sure why we would end up publishing multiple packages at the same time, maybe we can fix that instead? The code is full of race conditions and absolutely not ready to handle parallel uploads of the same packages.
Also worth noting that it seems it's the first time it occurs in close to 8 years (I've lost count) of this code being in production. Unless we're confident the code will not break elsewhere when introducing a fix, I would be OK with a one-time procedure to manually fix this and document it somewhere :grimacing:
I am not sure why we would end up publishing multiple packages at the same time, maybe we can fix that instead? The code is full of race conditions and absolutely not ready to handle parallel uploads of the same packages.
From my understanding, I believe we have been doing this for over two years based on the age of the wiki page... Publish a package from Github. To my knowledge there isn't a way to tell the GitHub workflow to not run in parallel. As such, once we use it, there is the potential to be sending multiple builds.
Also worth noting that it seems it's the first time it occurs in close to 8 years (I've lost count) of this code being in production. Unless we're confident the code will not break elsewhere when introducing a fix, I would be OK with a one-time procedure to manually fix this and document it somewhere 😬
I understand the hesitation in making such a fundamental change but I'd like to share another perspective:
icon.save()
the existing ones should be overwritten which should also be okayThanks for the additional context, it makes sense to try with this then :+1:
For some history. The first Ci used travis and CircleCI processed one arch at a time. Also it was expected that every dev build on their own devices before publishing so that they would test the build locally as well (that hasn't changed) but with GitHub actions it became easier and we implemented the publishing part as well. So if you build locally and test the one or two architectures you don't need to build all of them to publish. Lowering the barrier and time spend building a bit.
When attempting to publish a package the following is reported:
This appears to be happening consistently for moderately sized SPK files (50-75 MB) and does not appear to be based on time of day (server load).