ScilifelabDataCentre / dds_web

A cloud-based system for the delivery of data from SciLifeLab Facilities to their users (e.g. research group).
Other
8 stars 8 forks source link

Fix bucket successfully deleted from S3 but DB update fails #1524

Closed rv0lt closed 4 months ago

rv0lt commented 6 months ago

Read this before submitting the PR

  1. Always create a Draft PR first
  2. Go through sections 1-5 below, fill them in and check all the boxes
  3. Make sure that the branch is updated; if there's an "Update branch" button at the bottom of the PR, rebase or update branch.
  4. When all boxes are checked, information is filled in, and the branch is updated: mark as Ready For Review and tag reviewers (top right)
  5. Once there is a submitted review, implement the suggestions (if reasonable, otherwise discuss) and request an new review.

If there is a field which you are unsure about, enter the edit mode of this description or go to the PR template; There are invisible comments providing descriptions which may be of help.

1. Description / Summary

Fix raising of error when bucket successfully deleted from S3 but DB update fails

--- To reproduce ---

Add a few lines for the application to sleep

Release from the client project_1 and update the database to change the deadline. A testbucket has to exit in Minio

UPDATE projectstatuses SET deadline = CURDATE() - INTERVAL 1 DAY where project_id = 1;

Expire the project from backend container

flask set-available-to-expired

And update the deadline again

1. From the flask command (cronjob)

Add a second waiting time, after the else. In order to be able to turn on the db again and continue for the next projects.

Run flask set-expired-to-archived

And during the 15 seconds stop. Stop the mariadb container, the error got must be a

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')

i.e. The error throw if not from a request

2. From a request

Add another time stop in 774. After the line that catches the sqlerror and before creating the DeletionError

            if flask.request:
                flask.current_app.logger.debug("Sleeping for 15 seconds.")
                time.sleep(15)
                raise DeletionError(

That is, because, if the db is still stopped when trying to throw the exception. There will be a different error thown.

Remember to recreate the testbucket in Minio

From the client, run the command to expire the project, with the db container up.

During the first 15-second wait, stop the container to force the exception.

During the second 15 seconds wait, start the container again to be able to communicate with the client and return the error.

The command will fail in the client, and in the logs it can be seen that the DeletionError was generated.

2. Jira task / GitHub issue

Link to the github issue or add the Jira task ID here.

3. Type of change

What type of change(s) does the PR contain?

Check the relevant boxes below. For an explanation of the different sections, enter edit mode of this PR description template.

4. Additional information

5. Actions / Scans

Check the boxes when the specified checks have passed.

For information on what the different checks do and how to fix it if they're failing, enter edit mode of this description or go to the PR template.

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 91.84%. Comparing base (8006196) to head (677c5ee). Report is 8 commits behind head on dev.

:exclamation: Current head 677c5ee differs from pull request most recent head f0d1d85

Please upload reports for the commit f0d1d85 to get more accurate results.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## dev #1524 +/- ## ========================================== - Coverage 92.36% 91.84% -0.52% ========================================== Files 29 29 Lines 4804 4807 +3 ========================================== - Hits 4437 4415 -22 - Misses 367 392 +25 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

valyo commented 6 months ago

@rv0lt : do you have a good way to manually test this? If yes, please, add it to the description

rv0lt commented 6 months ago

@rv0lt : do you have a good way to manually test this? If yes, please, add it to the description

Manually? The thing I have been doing is to exec into the flask shell and see what was going on. You mean that?

valyo commented 6 months ago

Yes, but how to reproduce the situation when the error appears

rv0lt commented 6 months ago

code

What I have been trying to do is to catch the exception for the Deletion. Which is turned into the RuntimeError due to failed initialization (it can be seen down in the traceback). And try to grep the real error from the traceback.

Then, log it out adding the message of the real error. So it is easy to find (the first line of [ERROR] DeletionError: Project bucket contents were deleted, but they were not deleted from the database. Please contact SciLifeLab Data Centre.

So, it ends up logging the error but instead of the generic Runtime, it tries to be a bit more specific

Ngl, this whole task is a bit mind scratching honestly

I am getting this traceback:

[2024-04-10 13:15:58,057] commands [ERROR] DeletionError: Project bucket contents were deleted, but they were not deleted from the database. Please contact SciLifeLab Data Centre.
 Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
Traceback (most recent call last):
  File "/code/dds_web/commands.py", line 688, in set_expired_to_archived
    new_status_row, delete_message = archive.archive_project(
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/dds_web/api/project.py", line 492, in archive_project
    RemoveContents().delete_project_contents(project=project, delete_bucket=True)
  File "/code/dds_web/api/project.py", line 773, in delete_project_contents
    raise DeletionError(
          ^^^^^^^^^^^^^^
  File "/code/dds_web/errors.py", line 200, in __init__
    super().__init__((alt_message or "Deletion failed.") if not pass_message else message)
  File "/code/dds_web/errors.py", line 39, in __init__
    resource=flask.request.path or "not applicable",
             ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/werkzeug/local.py", line 316, in __get__
    obj = instance._get_current_object()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/werkzeug/local.py", line 513, in _get_current_object
    raise RuntimeError(unbound_message) from None
RuntimeError: Working outside of request context.

This typically means that you attempted to use functionality that needed
an active HTTP request. Consult the documentation on testing for
information about how to avoid this problem.
Traceback (most recent call last):
  File "/usr/local/bin/flask", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/cli.py", line 1050, in main
    cli.main()
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/flask/cli.py", line 357, in decorator
    return __ctx.invoke(f, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/code/dds_web/commands.py", line 713, in set_expired_to_archived
    project.project_statuses.append(new_status_row)
                                    ^^^^^^^^^^^^^^
UnboundLocalError: cannot access local variable 'new_status_row' where it is not associated with a value

but I still think that this PR is a different taks than the one I (at least) have in mind

valyo commented 6 months ago

I agree it is confusing, and I seem to lack understanding of some of the things involved (on flask and python level). But my main question is how do you trigger this error in your dev environment, so that you can get an exception in the first place

rv0lt commented 6 months ago

I agree it is confusing, and I seem to lack understanding of some of the things involved (on flask and python level). But my main question is how do you trigger this error in your dev environment, so that you can get an exception in the first place

Yes, I am trying to write some instructions. Until now, I was just running "flask shell" in the backend container (it just opens a python shell but that can be conected to the apliocation), and running the code lines one by one. I am trying to think of a way to make it more reproducible than just doing that

valyo commented 6 months ago

and how do you simulate the DB error?

rv0lt commented 6 months ago

and how do you simulate the DB error?

Ah, yes, same. Running one by one the lines of the function with the db container off. So writting some like this on the flask shell

try:
    try:
     model.query....
    catch **operationError**:
     raise **DeletionError**
catch **RuntimeError**:

Veeeeeery messy I know. That is why I am trying to find a way to document/test this better

rv0lt commented 6 months ago

So writting some like this on the flask shell

Just FYI. Reason why we cannot call just RemoveContents for example with the db off. It is because of decorators. They will detect it before executing the function logic. Probably mocking can be used here, will try something like that in a script

valyo commented 6 months ago

@rv0lt : here is the traceback I had in mind when we initiated this task: Traceback (most recent call last): File "/usr/local/bin/flask", line 8, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/site-packages/flask/cli.py", line 1050, in main cli.main() File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1659, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/flask/cli.py", line 357, in decorator return __ctx.invoke(f, *args, **kwargs) File "/usr/local/lib/python3.10/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/code/dds_web/commands.py", line 681, in set_expired_to_archived new_status_row, delete_message = archive.archive_project( File "/code/dds_web/api/project.py", line 492, in archive_project RemoveContents().delete_project_contents(project=project, delete_bucket=True) File "/code/dds_web/api/project.py", line 778, in delete_project_contents if isinstance(err, sqlalchemy.exc.OperationalError)UnboundLocalError: local variable 'err' referenced before assignment

If I understand this correctly, the DeletionError was raised and if we had sqlerr instead of err passed to isinstance, then we should have got the proper message printed. EDIT I managed to trigger 'sqlalchemy.exc.OperationalError' (by changing the query at line 754 to something that will produce db error: 'models.File.query.filter(models.File.unit_id == project.id).delete()') and even with the correct var sqlerr inside the 'DeletionError', we don't get the expected exception raising because of what @rv0lt was explaining all the time: 'DeletionError' inherits from 'LoggedHTTPException', which leads to 'RuntimeError'.

rv0lt commented 6 months ago

Added an explanation to recreate the situation

valyo commented 6 months ago

It seems this does the trick with RuntimeError: Working outside of request context. in a nice way.

Correct me if I am wrong, but we also want the flask command to register/print the error (what's now alt_message) and to continue running, don't we?

rv0lt commented 6 months ago

It seems this does the trick with RuntimeError: Working outside of request context. in a nice way.

Correct me if I am wrong, but we also want the flask command to register/print the error (what's now alt_message) and to continue running, don't we?

I see, the alt_message is more difficult, because this issue can be caused by different operations failures. And thus, we go again to the solution of having to inspect the traceback in running time.

The continue running, true. This will be caught by the outermost except, it can be solved by moving the archive.archive_project to the try block just below, this one continues the loop afterwards

valyo commented 6 months ago

hm, why should logging the error be difficult. Maybe I am naive but shouldn't something like this work:

        ) as sqlerr:
            alt_message=(
                "Project bucket contents were deleted, but they were not deleted from the "
                "database. Please contact SciLifeLab Data Centre."
                + (
                    "Database malfunction."
                    if isinstance(sqlerr, sqlalchemy.exc.OperationalError)
                    else "."
                )
            )
            if flask.request:
                raise DeletionError(
                    project=project.public_id,
                    message=str(sqlerr),
                    alt_message=alt_message,
                ) from sqlerr
            else:
                flask.current_app.logger.error(f"Error: {alt_message}")
rv0lt commented 6 months ago

hm, why should logging the error be difficult. Maybe I am naive but shouldn't something like this work:

        ) as sqlerr:
            alt_message=(
                "Project bucket contents were deleted, but they were not deleted from the "
                "database. Please contact SciLifeLab Data Centre."
                + (
                    "Database malfunction."
                    if isinstance(sqlerr, sqlalchemy.exc.OperationalError)
                    else "."
                )
            )
            if flask.request:
                raise DeletionError(
                    project=project.public_id,
                    message=str(sqlerr),
                    alt_message=alt_message,
                ) from sqlerr
            else:
                flask.current_app.logger.error(f"Error: {alt_message}")

Ah yes, that could work. Only thing that I am thinking is that maybe that generates two alerts? This logger and then the one by the catch in the upper function? Worth researching anyway

valyo commented 6 months ago

yes, I think I saw an error from commands.py too, when I made a quick try. But this shouldn't be a big issue!?!, as long as we get the "Project bucket contents were deleted, but they were not deleted from the ... blah blah" logged and the command continues. May be we should have short a discussion, just to be sure that I am not suggesting something completely stupid

rv0lt commented 6 months ago

Apart from the suggested change, I have a question: in commands.py, set_expired_to_archived, inside the main try...except block (lines 657 to 720) there are two more try...except blocks, one for calling archive_project and one for db.session.commit() Then, in project.py, archive_project calls delete_project_contents inside a try...except block, which in turn has a try...except Maybe I am just ignorant, but is this really necessary?

Yes, I was still checking the changes. That's why I didn't re-requested the review yet! I added the second try-catch to see if it can be avoided to register two different errors (thus generating two alerts)