SynoCommunity / spkrepo

Synology Package Repository
http://spkrepo.readthedocs.org
MIT License
152 stars 26 forks source link

Unable to publish package (500 Internal Server Error) #95

Closed mreid-tt closed 1 year ago

mreid-tt commented 1 year ago

When attempting to publish a package the following is reported:

HTTP/1.1 500 Internal Server Error
Connection: keep-alive
Content-Length: 141
Content-Type: text/html
Date: Sun, 23 Apr 2023 12:02:35 GMT
Server: nginx/1.23.2

<html>
  <head>
    <title>Internal Server Error</title>
  </head>
  <body>
    <h1><p>Internal Server Error</p></h1>

  </body>
</html>

This appears to be happening consistently for moderately sized SPK files (50-75 MB) and does not appear to be based on time of day (server load).

mreid-tt commented 1 year ago

@Diaoul, @ymartin59, any thoughts on this?

EDIT: So apparently, the actual error in the application log was this:

[2023-04-25 09:50:00 +0000] [8] [ERROR] Error handling request /packages
Traceback (most recent call last):
  File "/usr/src/app/spkrepo/views/api.py", line 264, in post
    build.save(spk.stream)
  File "/usr/src/app/spkrepo/models.py", line 357, in save
    with io.open(
FileNotFoundError: [Errno 2] No such file or directory: '/data/lidarr/11/lidarr.v11.f42661[comcerto2k].spk'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 136, in handle
    self.handle_request(listener, req, client, addr)
  File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
    respiter = self.wsgi(environ, resp.start_response)
  File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2091, in __call__
    return self.wsgi_app(environ, start_response)
  File "/usr/local/lib/python3.9/site-packages/werkzeug/middleware/proxy_fix.py", line 187, in __call__
    return self.app(environ, start_response)
  File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2076, in wsgi_app
    response = self.handle_exception(e)
  File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 2073, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1518, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 271, in error_router
    return original_handler(e)
  File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1502, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/src/app/spkrepo/views/api.py", line 47, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 467, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/flask/views.py", line 84, in view
    return current_app.ensure_sync(self.dispatch_request)(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/flask_restful/__init__.py", line 582, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/src/app/spkrepo/views/api.py", line 278, in post
    abort(500, message="Failed to save files", details=e.message)
AttributeError: 'FileNotFoundError' object has no attribute 'message'

Based on the above the actual cause of the (500) Internal Server Error was an unhandled exception for FileNotFoundError. This I've authored a PR #96 to address.

mreid-tt commented 1 year ago

This leaves the question of the original issue of the uploaded file from the stream not being present at the expected path during publishing. From examining the server code my understanding of the publishing process is as follows:

  1. When a new package is posted, the server creates an SPK object. This object is based on the SPK class and is passed the request data as its argument
  2. The server parses the SPK file content to extract the necessary information such as firmware, package name, version, etc.
  3. The server creates a Package object if one doesn't already exist, based on the package name in the request
  4. The server then creates a Version object for the package, based on the package name and version number in the request. If a Version object with the same version number already exists, the new version is not created
  5. The server also creates a Build object, which is not active by default. If a Build object with the same architectures exists, a conflict error is returned
  6. The server saves the files by (a) creating a new directory for the package if there is a new Package object, (b) creating a new version sub-directory for the package if there is a new Version object and (c) saves the named spk file (using the SPK object) to this version sub-directory

Now the error seen: FileNotFoundError: [Errno 2] No such file or directory: '/data/lidarr/11/... suggests that in attempting to save the named spk file, it is unable to do so. It is possible that while the path /data/lidarr/ does exist (since there are other versions present), the version sub-directory does not. A missing sub-directory may cause the save to fail.

mreid-tt commented 1 year ago

I suspect that there may have been some sort of corruption when the first package from lidarr v11 was uploaded. The admin interface reports the following:

Screenshot 2023-04-25 at 1 03 26 PM

However, when we look at the webpage it shows the following:

Screenshot 2023-04-25 at 1 42 32 PM

This version should not be visible since it is disabled in the admin interface. The application icon in the webpage was also not visible (before v12 was uploaded), which suggests that the code which runs when the new Version object is present was not initially run.

mreid-tt commented 1 year ago

I was publishing some other packages and I see the same corruption happening with prowlarr v6. The admin interface reports the following:

Screenshot 2023-04-26 at 7 53 47 AM

However, the webpage at https://synocommunity.com/package/prowlarr shows the following:

Screenshot 2023-04-26 at 7 54 02 AM

Note the missing icon as I indicated above with lidarr v11 when no newer version is present. I believe this happens because this code is not being executed:

https://github.com/SynoCommunity/spkrepo/blob/76951e0e5850504b16c9164b97a23883488269d6/spkrepo/views/api.py#L260-L263

Diaoul commented 1 year ago

Yes there must've been a failure at a time that did create the version but not the appropriate directories, causing subsequent ones to fail. Maybe the best move is to delete the version and re-upload everything?

mreid-tt commented 1 year ago

Yes there must've been a failure at a time that did create the version but not the appropriate directories, causing subsequent ones to fail. Maybe the best move is to delete the version and re-upload everything?

Thanks for your feedback. Before deleting the version can you check:

  1. The application log to see if it is reporting the same FileNotFoundError: [Errno 2] No such file or directory errors
  2. The file system to determine if the path /data/prowlarr/6/ exists. If so, is anything in that directory?

If the checks above confirms the theory then I’d request that the prowlarr v6 be deleted. I don’t think I have permissions to do that.

mreid-tt commented 1 year ago

hey @Diaoul, any feedback on the above?

In the meantime, would you be so kind as to delete this version:

Screenshot 2023-04-27 at 7 09 56 AM
publicarray commented 1 year ago

@mreid-tt I have deleted lidarr v11 for you, please let me know of any others that you like to delete

mreid-tt commented 1 year ago

thanks @publicarray, I was hoping we could check the application logs and the file system for prowlarr to confirm the theory above. But if this is a hassle we can just delete the problematic version:

Screenshot 2023-04-27 at 7 52 36 AM
publicarray commented 1 year ago

@mreid-tt the folder was never created...

publicarray@vps-b34822e4 /e/synocommunity> ls data/prowlarr
4/  5/
publicarray commented 1 year ago

I have deleted the broken version

publicarray commented 1 year ago

We had problems uploading before (due to load) but never like this (missing folders/files)...

mreid-tt commented 1 year ago

@mreid-tt the folder was never created...

publicarray@vps-b34822e4 /e/synocommunity> ls data/prowlarr
4/  5/

Thanks for the confirmation.

I have deleted the broken version

Great! I'll proceed with a fresh build and upload.

mreid-tt commented 1 year ago

We had problems uploading before (due to load) but never like this (missing folders/files)...

Hmm, okay. Perhaps a possible fix would be to adjust the code to verify that a folder is created (for both the package and version functions) before committing a new entry to the database.

publicarray commented 1 year ago

my2 cents but maybe only upload one file at a time. IIRC phython/postgresql are bad at working with multiple upload requests at the same time. This smells like a race condition

mreid-tt commented 1 year ago

my2 cents but maybe only upload one file at a time. IIRC phython/postgresql are bad at working with multiple upload requests at the same time. This smells like a race condition

That makes sense but with a GitHub build how would we do that? All the archs build concurrently.

mreid-tt commented 1 year ago

I have deleted the broken version

Can you confirm this was done? I'm still seeing this:

Screenshot 2023-04-27 at 8 12 29 AM
publicarray commented 1 year ago

hm maybe build them without publishing on github and download the result. Then manually create the upload request with httpie. I know it's a workaround but we can also test if, for example you upload all packages at once if this causes this problem again.

publicarray commented 1 year ago

Maybe your page is cached. image Edit: sorry I'm not sure, but I deleted v6 again, and it is now gone.

mreid-tt commented 1 year ago

I think you may be correct. I did the new run and got this in one of the logs:

Package published successfully
{"package": "prowlarr", "version": "20230427-6", "firmware": "6.1-15047", "architectures": ["apollolake", "avoton", "braswell", "broadwell", "broadwellnk", "broadwellntbap", "bromolow", "cedarview", "denverton", "dockerx64", "geminilake", "grantley", "purley", "kvmx64", "v1000", "x86", "x86_64"]}

But then right after, I get the following in all the other archs:

ERROR: Failed to publish package - HTTP response code 500

We may have the same situation again. If you can, please remove prowlarr v6 again because I suspect we have re-created the same situation. I will try a manual upload one file at a time once you have.

Edit: sorry I'm not sure, but I deleted v6 again, and it is now gone.

Okay, thanks. I'll retry, one at a time now.

publicarray commented 1 year ago

we got one :)

publicarray@vps-b34822e4 /e/synocommunity> ls -l data/prowlarr/6
total 55524
-rw-r--r-- 1 root root    44917 Apr 27 12:25  icon_256.png
-rw-r--r-- 1 root root     7211 Apr 27 12:25  icon_72.png
-rw-r--r-- 1 root root 56801280 Apr 27 12:25 'prowlarr.v6.f15047[apollolake-avoton-braswell-broadwell-broadwellnk-broadwellntbap-bromolow-cedarview-denverton-dockerx64-geminilake-grantley-purley-kvmx64-v1000-x86-x86_64].spk'
mreid-tt commented 1 year ago

Sending one at a time worked! All archs successful: https://github.com/mreid-tt/spksrc/actions/runs/4819585180

EDIT: I'll look into making a PR to verify that the folder has been created for both package and version. Running parallel processes does not seem to be an issue once the first one correctly creates the folder.

mreid-tt commented 1 year ago

Hi @Diaoul, now that we have conclusively proven that the issue is rooted in the version folder not being created, this suggests that this code block is not executed:

https://github.com/SynoCommunity/spkrepo/blob/66c4ac66af3b0e8e9adc1c9ef92df3b599baab73/spkrepo/views/api.py#L260-L263

If this is the case, how could we make the function more resilient? Should the os.mkdir() be put into a try block to catch any exceptions? Should we have a line right afterwards to verify that the path exists? The worry I have is that it thinks the command was successful. This is further confirmed in the log which reports the following back in GitHub:

  Package published successfully
  {"package": "prowlarr", "version": "20230426-6", "firmware": "6.1-15047", "architectures": ["evansport"]}

Why might this be the case? Are the filesystem actions for directory creation and build.save() being executed asynchronously and never complete even though the function believes it was executed? For the sequence of events, when exactly is the new version added to the database? If it is not asynchronous, are multiple function calls trying to create the same version when GitHub runs parallel build operations? If so, why wouldn't at least one function call be completed? That way even if the other parallel calls fail (or do nothing because the folder exists), the successful one should create the folder, shouldn't it?

Interested in your thoughts before I create a PR so that the solution is effective.

publicarray commented 1 year ago

My guess is that if a version exist in the db, but the transaction isn't finished yet and another one is created at the same time we get this condition. Not sure why not at least one mkdir command is executed though. Maybe both transactions exit early because a version exists already?

publicarray commented 1 year ago

I also noted that if there is an exception, files are removed so what happens when a file is created and at the same time removed by a different request?

mreid-tt commented 1 year ago

I also noted that if there is an exception, files are removed so what happens when a file is created and at the same time removed by a different request?

Ahh! That's a good point, you are referring to this part of the code:

https://github.com/SynoCommunity/spkrepo/blob/66c4ac66af3b0e8e9adc1c9ef92df3b599baab73/spkrepo/views/api.py#L265-L272

So, it's possible it was created just fine but there was some sort of exception (perhaps even in creating the icons or saving the build), then the entire version folder would be dumped. If this is the case then the version should also be removed from the database (or not added if this is done later). Need to better understand the database interactions to confirm.

publicarray commented 1 year ago

Yea. Imagine 2 requests. Both don't see a version so they both try to create one on the BD. The db will reject one as it already exist (consistency in the db). So one succeeds and one fails (creating an exception) removing the folder but the DB is updated to have a version. Just a theory.

Filtering the exception for 'version already exist' and not removing files should prevent this.

mreid-tt commented 1 year ago

So I shared our problem with ChatGPT and it suggested some approaches which I have included in a proposed fix https://github.com/SynoCommunity/spkrepo/pull/100. I would appreciate a review in that PR.

Diaoul commented 1 year ago

Thinking a bit about this, I am not sure why we need a clean up the directory structure :shrug: We could very well make sure the path always exists. However, this could hide another consistency problem (like in #100)

publicarray commented 1 year ago

I think I agree with @Diaoul here, using locks just overcomplicates things. I think just checking if there is an exception like "version already exists" and continuing the code anyway rather than cleaning up the folders is enough in this case. I'd like to mention that what I think is happening is a hypothesis and not proven yet.

There could have been an issue with the redis or database connection. Unfortunately, there aren't time codes in the app logs. (unlikely the cause since the version exists in the database)

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not connect to server: Connection refused
    Is the server running on host "db" (172.18.0.3) and accepting
    TCP/IP connections on port 5432?

redis.exceptions.ConnectionError: Error while reading from redis:6379 : (104, 'Connection reset by peer')
Exception possibly due to cache backend.

edit: never mind, I can get the timestamp from docker.

mreid-tt commented 1 year ago

Thanks for the additional perspectives. Taking a step back, I believe the issue is really about the filesystem and not the database. Re-looking at the code we have these operations to create folders:

https://github.com/SynoCommunity/spkrepo/blob/66c4ac66af3b0e8e9adc1c9ef92df3b599baab73/spkrepo/views/api.py#L259

My understanding is that if the folder already exists, it will raise a FileExistsError. This is another possible reason for the try block failing and a more likely scenario for multiple workers adding a new build. I've done another PR with a far simpler fix for consideration in #101.

mreid-tt commented 1 year ago

Thinking a bit about this, I am not sure why we need a clean up the directory structure 🤷 We could very well make sure the path always exists.

I actually like the elegance of cleaning up the directory structure on an exception so you maintain a clean filesystem. We could adjust the code to something like this:

try:
    data_path = current_app.config["DATA_PATH"]
    dir_path = os.path.join(data_path, package.name)
    if not os.path.exists(dir_path):
        os.mkdir(dir_path)

But I really believe that the exception on the try block is more likely caused by the folder already existing which can easily be mitigated with the existing code structure.

mreid-tt commented 1 year ago

@publicarray, looking back at the application logs you kindly provided, I believe this new theory is correct. In the logs we first see this error:

[2023-04-16 12:32:53 +0000] [7] [ERROR] Error handling request /packages
Traceback (most recent call last):
  File "/usr/src/app/spkrepo/views/api.py", line 261, in post
    os.mkdir(os.path.join(data_path, package.name, str(version.version)))
FileExistsError: [Errno 17] File exists: '/data/lidarr/11'

Immediately after that we have these errors and similar:

[2023-04-16 12:33:40 +0000] [10] [ERROR] Error handling request /packages
Traceback (most recent call last):
  File "/usr/src/app/spkrepo/views/api.py", line 264, in post
    build.save(spk.stream)
  File "/usr/src/app/spkrepo/models.py", line 357, in save
    with io.open(
FileNotFoundError: [Errno 2] No such file or directory: '/data/lidarr/11/lidarr.v11.f42661[comcerto2k].spk'

I believe this supports the theory that:

  1. Two workers attempt to create a directory at '/data/lidarr/11'
  2. One succeeds and the other fails
  3. The failing one deletes the directory path
  4. Other workers saving builds fail because the path does not exist
Diaoul commented 1 year ago

I am not sure why we would end up publishing multiple packages at the same time, maybe we can fix that instead? The code is full of race conditions and absolutely not ready to handle parallel uploads of the same packages.

Also worth noting that it seems it's the first time it occurs in close to 8 years (I've lost count) of this code being in production. Unless we're confident the code will not break elsewhere when introducing a fix, I would be OK with a one-time procedure to manually fix this and document it somewhere :grimacing:

mreid-tt commented 1 year ago

I am not sure why we would end up publishing multiple packages at the same time, maybe we can fix that instead? The code is full of race conditions and absolutely not ready to handle parallel uploads of the same packages.

From my understanding, I believe we have been doing this for over two years based on the age of the wiki page... Publish a package from Github. To my knowledge there isn't a way to tell the GitHub workflow to not run in parallel. As such, once we use it, there is the potential to be sending multiple builds.

Also worth noting that it seems it's the first time it occurs in close to 8 years (I've lost count) of this code being in production. Unless we're confident the code will not break elsewhere when introducing a fix, I would be OK with a one-time procedure to manually fix this and document it somewhere 😬

I understand the hesitation in making such a fundamental change but I'd like to share another perspective:

  1. I don't believe that this is the first time this has happened. In the application logs shared with me that had entries between 2023-02-19 and 2023-04-25 there was one other version directory missing '/data/owncloud/9'. More likely what has been happening is that because the GitHub publish function did not check for errors (which it now does as part of spksrc#5701) the uploads appeared to be successful when they were not (see spksrc#5619 (comment))
  2. You are correct that there is the possibility of the code breaking elsewhere but from a database perspective you have referential integrity constraints built-in to the Package and Version models so should there be a duplicate addition, it would cleanly fail. For the addition of the Build, there is the possibility of the object being linked to the incorrect Version object and I can't think of a way to address this without locks
  3. As far as the filesystem, if the folder already exists we should be fine with the proposed fix. If we try to save duplicate icons, based on the statement construction for icon.save() the existing ones should be overwritten which should also be okay
  4. As far as manually fixing it, the way I've seen it done is to delete the version in the admin panel and re-upload from scratch. While this can work, not all devs have admin permissions to do this. Additionally, it does not prevent it from happening again with another GitHub workflow (unless you manually re-run each failed job one at a time)
Diaoul commented 1 year ago

Thanks for the additional context, it makes sense to try with this then :+1:

publicarray commented 1 year ago

For some history. The first Ci used travis and CircleCI processed one arch at a time. Also it was expected that every dev build on their own devices before publishing so that they would test the build locally as well (that hasn't changed) but with GitHub actions it became easier and we implemented the publishing part as well. So if you build locally and test the one or two architectures you don't need to build all of them to publish. Lowering the barrier and time spend building a bit.