DavidoTek / ProtonUp-Qt

Install and manage GE-Proton, Luxtorpeda & more for Steam and Wine-GE & more for Lutris with this graphical user interface.
https://davidotek.github.io/protonup-qt
GNU General Public License v3.0
1.25k stars 40 forks source link

DXVK: Update dxvk-async to use Ph42oN GitLab repo #302

Closed sonic2kk closed 10 months ago

sonic2kk commented 11 months ago

(NOTE: This PR is still in draft and not quite ready for testing just yet, while I work out a couple of remaining trouble areas with the archive we download from GitLab, the description of this PR should be pretty final but is mostly written up so I can document the work I have done thus far and amend as I go :slightly_smiling_face:)

PR should be in a working state and ready for review now :-)


Implements #243.

Overview

This PR implements downloading the new DXVK Async project by Ph42oN from GitLab, replacing the dxvk-async project on GitHub by Sporif which is incompatible with newer DXVK versions.

image image

Background

The dxvk-async project on GitHub that we currently point to no longer sees active development. To my understanding, a change was introduced in upstream DXVK around v2.0 that broke dxvk-async, and for one reason or another, it will not be updated beyond DXVK v2.0.

Someone else picked up the mantle and made a dxvk-async compatible with the new GPL feature in newer DXVK releases (Graphics PipeLine, not to be confused with GNU GPL :wink:), and it was requested that ProtonUp-Qt point to this newer project instead. However, this project is hosted on GitLab, and much of the logic ProtonUp-Qt uses to get releases and assets and build the associated data from this to download a release is very much expecting a GitHub API response. A significant refactor would be required.

The Refactor

The main reason I put off looking at this for so long, the refactor. First, I needed to look into the GItLab API response and what sort of structure it comes back with. But that was the easy part, the problem was I didn't want to simply litter the ctmod with "if gitlab -> do this, if github, do that" and then have to do that for any future ctmod that might use GItLab. I wanted to try and make something a bit more re-usable.

I originally considered starting off with creating a base ctmod class that could handle this, but this was... going to be a lot of work and I have tried figuring it out quite a few times locally. So I bit the bullet and thought well, let's just try and do this with some utility functions, we could always move them into a base ctmod class later :-)

So that's exactly what I did, I created a number of re-usable util functions that abstract away the various components of the GitHub vs GitLab API call. We can call one function to get all of the tags, for example, giving it the URL and some other information, and that function handles whether we're using GitHub or GitLab, and parses the information we need. I also tried to make them as generic as possible, which is why get_download_url_from_asset takes the crazy asset_condition: Optional[Callable] -- Each ctmod will needs its own filter rules for assets, so I figured the ctmod should pass the condition as a lambda. This gets filtered through the initial call, so it can be used anywhere it's needed, such as if we need to filter on something that isn't the asset URL.

And because these functions can handle GitHub and GitLab, it should be possible to drop them into all ctmods, and in future, into a base ctmod class. I kept this kind of re-use and future development in mind when writing these functions. There may still be some extending needed in future, but I tried to keep the functions modular enough to the point where this would be possible. For example, if we needed to introduce a filter lambda parameter elsewhere.

This refactor was made with GitHub and GitLab in mind, as well as any future hosts we might need to call to, such as BitBucket, so we should be able to extent the logic semi-easily. One set of functions to rule them all :-)

I still don't know if these methods are architected in the best way. We're passing release_url quite a lot, for example. So as usual all feedback on my Python scribbles are more than welcome.

Existing DXVK Functionality

This should not break existing DXVK functionality, and should hopefully be generic enough that if other DXVK ctmods are ever added or combined (such as if D8VK is combined and uses releases instead of the current broken nightlies), they should also be compatible.

One change I made that may potentially have an impact: Content-Length does not come back on the headers from GitLab, and one alternative I saw was to use len(request.content). It seems to be working for these ctmods but i'm unsure if it has any potential Unforeseen Consequences. I wanted to call out this change just in case, even though it is currently working :-)

Considerations

Should we make this a new ctmod? It would be straightforward, since the refactor uses generic methods called from the one DXVK ctmod. We would simply need to make the dxvk-gplasync a separate ctmod. and restore the URLs I changed in this PR. Though, I worry that a change like this would impact

In an entirely separate vein, should we remove the older DXVK Async altogether and rename this to be dxvk-gplasync? Or would the description update cover this?


TODO:

sonic2kk commented 11 months ago

Fixed the archive issue that was present when I opened this PR (turns out I was parsing for the wrong archive types :sweat_smile:), removed some unused code, and now this should be ready for testing and review. This PR will now download both DXVK releases from GitHub (no change to existing functionality) and DXVK Async releases from GitLab (this PR updates the ctmod to use the new dxvk-gplasync repo on GitLab, and has a set of common functions that allows for re-using the same functions to get the information we need).

Type hinting issues have also been resolved :-) Thanks!

sonic2kk commented 11 months ago

Hm, one area I forgot about while doing a final-final pass to make sure the way I broke the functions up made sense: I didn't implement a rate-limit check for GitLab yet. There are also probably a few other places we could do a GitHub rate-limit check. Those are two possible areas for improvement with this PR I've identified, so any more are welcome :-) And please do ask for clarification if needed, managing the best way to break down how to get the information we need from these APIs in a flexible way was definitely a challenge, and anything I can do to explain it or comment/rewrite anything to make it more straightforward, all feedback is appreciated.

sonic2kk commented 11 months ago

Something I also just realized is that GitLab can be self-hosted, so it is possible in the future that we may want to fetch from a GitLab instances that isn't gitlab.com. One solution in future (or now, if desired) would be to have the GITLAB_API constant be a list, and we can just append the domains for GitLab instances we need to call out to here. It would need a slight refactor, probably we could do something like:

# constants.py
KNOWN_GITLAB_INSTANCES = [
    'https://gitlab.com/api',
    'https://madeup.com/api'
    'https://gitlab.steamos.cloud/api'
]

# util.py
## New method to check if URL is GitLab
def is_gitlab_instance(url: str) -> bool:
    """
    Check if a full API endpoint URL is in the list of known GitLab instances.
    Return Type: bool
    """

    # The check is this way around because we expect `url` to be a "full" URL that will always at least contain the base instance API from our constants
    # i.e., url='https://gitlab.com/api/v4/projects/43488626/releases'
    #       instance='https://gitlab.com/api/'
    return any(instance in url for url in KNOWN_GITLAB_INSTANCES)

## Usage i.e. in fetch_project_releases
## elif is_gitlab_instance(releases_url): print('do things')

Just some potential usage off the top of my head, we could probably feasibly implement this now, though we would only be using one GitLab instance.

We would also need to do something similar for the Rate Limit check, as this is configurable per-instance, and the default could also change. Right now, the default seemed to be this text noted in this GitLab PR, but this seems to have since changed as per the docs to "Retry later". But as each instance could change it, and since newer GitLab releases could potentially change it and self-hosted instances could still be using the older default or a custom message, we would need a list check like this.

I still haven't figured out what the actual rate limit response message JSON looks like, I couldn't find it documented anywhere, so I may have to hit the rate-limit before I can find out what it looks like. We can probably reasonably assume the format will be the same for each GitLab instance. It could theoretically be changed since the code can be changed however the owner of the instance wants, but I think for now we can assume the API responses will be structured the same for this rate-limit message and for other API calls.

Documentation:


There's also the concern that various GitLab instances may be running older releases and may have different API usage or responses, which is trickier to get around. We could make some wrapper functions to check the GitLab API version and modify the strings we look for accordingly, but I think we can just cross this bridge when we come to it. I'm not sure how much the GitLab API usage changes between versions, new endpoints may be added but the endpoints we're using seem to be pretty fixed and don't change much.


One final consideration I have is a way to pass a GitLab API token as an environment variable, the same way we do for GitHub.

sonic2kk commented 11 months ago

I have implemented an initial GitLab API rate-limit check. In general, GitLab seems a lot more forgiving than GitHub when it comes to rate limits just from browsing the GitLab docs. I have not been able to rate-limit myself to check the response, so for now, I have simply gone with checking on the message key from the JSON response. We have a list of what I believe are known baseline API rate limit response messages from GitLab.

In our API method we check if message has this rate-limit string, similar to what we do for GitHub. However, it may be better to check on status code, or if nothing else, check on status code as well as the message. I'm not sure about GitHub, but GitLab returns 427 as the status code. I don't think this is configurable for GitLab from what I have seen, but if it is, that is a potential reason to not check on status code alone. We could do an or check though.

We need to get JSON back though from this rate-limit wrapper function. So to do that, we could either:

  1. Change the functions to take in the requests.Response object, and then the function could parse the status code and JSON from this, and return the json. This should be feasible for this specific case where we're calling this inside of a ctmod. However, if this is not feasible for other usages of the rate-limit check functions (i.e. if they have to or can only take the JSON as they're called without access to the original response object), then...
  2. We could pass the status code to these methods. This may still be tricky if we don't have access to the original requests.Response object but we could filter it down through and it may be simpler than filtering through the whole Response when we only need the status code to check.

If either of these would require a significant refactor, though, we can keep the simple string check. A status code check just seems more "bullet-proof" to me. And if GiitHub and GitLab each return different status codes (i.e. GitHub returns 401 or 403, whereas GitLab seems to return a non-configurable 427), we could store these in constants.py. It could also open the door for more flexible response checking, such as checking for a 500 response and inferring that GitHub is down instead of us being rate-limited.


I also pushed a change to check on a list of GitLab instance URLs, basically implementing it how I described in my code snippet above with is_gitlab_instance. This seems to be working well in testing. It works out of the box already, and hardcoding the DXVK Async URL to be an invalid GitLab URL confirmed that the util method correctly returns False.

Right now we only have the one URL, and I guess there's a chance of a performance penalty here since we're using this check on a list with any multiple times (at least 3 before downloading), but it didn't add any noticeable slowdown on my PC, so it's probably negligable. I guess it would be possible to do some length check and check against a default gitlab.com but I dunno, that seemed really overkill to me, and the flexibility offered here of being able to extend this list to other instances in the future is, in my mind, worth what is probably at most a fraction of a fraction of a second more overhead :-)

DavidoTek commented 11 months ago

Thanks! :tada: Looks good at a first glance. I will do an in-depth review next week.

In an entirely separate vein, should we remove the older DXVK Async altogether and rename this to be dxvk-gplasync? Or would the description update cover this?

Good thought. Renaming the ctmod might make sense as it is a different thing after all, even though it tries to pursue the same purpose. I'm not sure what we should do about the old ctmod (for Sporif's dxvk). We can simply remove it, or make it available only in advanced mode. Is there any advantage to using the old version?

download both DXVK releases from GitHub (no change to existing functionality) and DXVK Async releases from GitLab

Ah, that's good.

managing the best way to break down how to get the information we need from these APIs in a flexible way was definitely a challenge, and anything I can do to explain it or comment/rewrite anything to make it more straightforward, all feedback is appreciated

The way you did it looks quite logical. I will comment when I review the code.

One final consideration I have is a way to pass a GitLab API token as an environment variable, the same way we do for GitHub.

Does it work the same as for GitHub, by adding an Authorization field to the header? We could just create multiple requests sessions or modify the header inside the ctmod. Though I think it might be worth reworking the logic completly as we are currently passing the mainwindow object and reading the requests session from it, thus having a high coupling of the modules. That is not ideal.

but GitLab returns 427 as the status code

That seems like a solid solution, but as you pointed out, will only work if this is consistent to all GitLab instances. We can do a hybrid-solution as you have suggested in case we are not sure abbout that status code.

We need to get JSON back though from this rate-limit wrapper function. So to do that, we could either:

I think passing the requests.Response object will give us more flexibility in the future.

I also pushed a change to check on a list of GitLab instance URLs, basically implementing it how I described in my code snippet above with is_gitlab_instance.

That seems like a good solution. What we could also do, is to include the API URL inside each ctmod like this:

LOCAL_GITLAB_API = 'https://madeup.com/api'
def is_gitlab_instance(url: str) -> bool:
    return any(instance in url for instance in [GITLAB_API, LOCAL_GITLAB_API])

I guess there's a chance of a performance penalty here since we're using this check on a list with any multiple times (at least 3 before downloading)

I don't think it is that big of a deal, especially if we have only a few elements.

sonic2kk commented 11 months ago

Does it work the same as for GitHub, by adding an Authorization field to the header?

Yes, it seems like the API token is simply passed as Authorization in the headers. However, it's a little bit different than GitHub, and there are a few ways to go about it. Here's the GitLab docs for reference, but basically from what I can gather:

The docs have separate sections for using OAuth tokens, which we probably aren't going to use, and Personal Access Tokens generated on GitLab (here's the docs link). The usage mentioned above is taken from the access tokens section of the docs. I have not yet tested whether this works, but it should hopefully be feasible to update the token headers like this?

# Is it even valid to add two authorization headers like this...?
# Or will we need two requests sessions, one for GitHub and one for GitLab?
if github_token := os.getenv('PUPGUI_GHA_TOKEN'):
    self.rs.headers.update({'Authorization': f'token {github_token}'})
if gitlab_token := os.getenv('PUPGUI_GLA_TOKEN'):  # Do we want to give this a more distinct name? :-)
    self.rs.headers.update({'Authorization': f'Bearer {gitlab_token}'})

I will try and do some tests with this soon. I think GitLab is generally a lot more forgiving when it comes to API requests, though instances can configure the rate limits themselves, so while in general we will hopefully run into rate-limiting less with GitLab it's still something to account for. I have personally yet to rate-limit myself from GitLab, but it's easier to run into with GitHub in general I have found. In case an instance is stricter or if GitLab.com changes its request limits, it's good for us to account for this case.


I should note as well that GitLab access tokens are instance-specific. So an access token for GitLab.com would not be valid on GNOME GitLab or KDE Invent for example. The user can access the GitLab page for a project using the info button on the install ctmod dialog, so they should be able to use this to find out what instance they need to sign up to and generate an access token for, but I wanted to mention this nonetheless.

sonic2kk commented 11 months ago

Well, good news and bad news.

The good news is that passing Authorization: Bearer <gitlab_personal_access_token> works when testing with curl, so this approach to the API is fine. Passing an invalid token produces a 401.

The bad news is that if you only pass the GitLab token, GitHub will try to take that for authorization, and it'll fail. Likewise, passing a GitHub token will cause GitLab to fail.

I'm not sure how we'd want to handle this. Having two requests sessions seems good on the surface, but if we ever need to extend to more than just GitHub and GitLab, it would get messy. I'm not sure of the best way to go about handling this for now :sweat_smile:

DavidoTek commented 11 months ago

The good news is that passing Authorization: Bearer works when testing with curl, so this approach to the API is fine. Passing an invalid token produces a 401.

Okay, that is good.

The bad news is that if you only pass the GitLab token, GitHub will try to take that for authorization, and it'll fail. Likewise, passing a GitHub token will cause GitLab to fail.

Hmm, ideally we would handle this inside the ctmod then. I think there are two alternatives:

  1. Passing all available tokens to the ctmod, e.g., {'github': '...', 'gitlab': '...', ...}. The simplest way to do this would be to simply add it to the mainwindow object and then access:
    # pupgui2.py, __init__
    # remove the self.rs part
    self.web_access_tokens = {
       'github': os.getenv('PUPGUI_GHA_TOKEN'),
       'gitlab': os.getenv('PUPGUI_GLA_TOKEN'),
    }
    # ctmod
    self.rs = requests.Session()
    if token := main_window.web_access_tokens.get('gitlab'):
       self.rs.headers.update({'Authorization': f'token {token}'})
  2. Reading the environment variables directly inside the ctmod using os.getenv

I'd personally prefer the first one.

sonic2kk commented 11 months ago

I'm looking into the first approach, making some headway and I'll push something up soon hopefully :-)

sonic2kk commented 11 months ago

So I've figured out a solution pretty similar to what you outlined. Basically, ctmods will now take request_headers as an optional parameter and set their self.rs to use that as the headers.

In __init__, we then build authorization into the headers, if it's not in the request_headers given to the ctmod already, and if there's a token available from the environment variables. Since each ctmod will only have one project URL, we default to using the token relevant to that CT_URL. In other words we can assume a ctmod with a CT_URL pointing to GitHub will use the GitHub token, if given. One ctmod can't have multiple project links, to do that it has to be subclassed where we can override the request_headers parameters when calling super.

Here's a snippet of the code from the DXVK ctmod I have locally:

def __init__(self, main_window = None, request_headers = {}):

    # ...

    # If we didn't give any custom authorization in our request_headers parameter (or if we didn't pass any to begin with),
    # check if we should create our own authorization header using default GitHub access token
    # This will also preserve any other custom headers, which may not include authorization, so we can insert the auth token
    # from the environment into the headers if given even if not specified in request_headers
    if 'Authorization' not in request_headers:
        if main_window and main_window.web_access_tokens.get('github', None):
            request_headers['Authorization'] = f'token {main_window.web_access_tokens.get("github", None)}'
            print(f'Got updated token from: {request_headers["Authorization"]}')

    self.rs.headers.update(request_headers)

This will use the GitHub token if the request_headers doesn't pass any, and normally it won't, except when it's subclassed and we call super, like in the DXVK Async ctmod:

def __init__(self, main_window = None, request_headers = {}):
    if main_window and main_window.web_access_tokens.get('gitlab', None) and 'Authorization' not in request_headers:
        request_headers['Authorization'] = f'Authorization: Bearer {main_window.web_access_tokens.get("gitlab", None)}'

    super().__init__(main_window, request_headers)

This does a very similar check as the DXVK ctmod, except it's checking for GitLab. And we can do this because the CT_URL is pointing to GitLab for this ctmod, so we can assume that's the token we should use.

The idea here is that each ctmod is responsible for setting its authorization headers accordingly internally, and also that subclasses can override this as needed.


It's safe to do self.rs.headers.update({}) by the way, empty headers are fine and everything still works as expected in my tests :-)


This is working mostly There are two linked issues. The first is minor, it's really just that this building of the request_headers is repeated between the two ctmods, and it would be nicer if it was broken out into a reusable method that returned headers with authorization. It should be doable, then we can do something like self.rs.headers.update(build_headers_with_authorization(request_headers, main_window.web_access_tokens, 'github')).

The real problem is when you do the following with the changes I have locally:

  1. Pass a GitLab access token with the PUPGUI_GLA_TOKEN environment variable
  2. Select the DXVK (GitHub) ctmod, loads fine
  3. Select a GitLab ctmod, loads fine (the only one we have right now is DXVK Async)
  4. Go back to a GitHub ctmod, it won't load because the session headers were updated in the DXVK Async ctmod, and it's trying to access GitHub using the GitLab ctmod.

The fix for this is to update all ctmods to use this authorization header building logic, so it selects and updates it to use the correct one. It should be possible, and would be required anyway since right now we aren't setting the authorization at startup in MainWindow, instead we're going to do it per-ctmod.

If no access tokens are passed, everything is fine.


That's my update so far. I'll probably push what I have for now but be aware of this interim behaviour and that a lot of ctmods will need to be touched, unless you want those changes to go in a separate PR :-) I can also hold off on touching all of those until you can give a review of the current changes.

sonic2kk commented 11 months ago

Another consideration here I just thought of: Any GitHub (or potentially GitLab in future) API call outside of a ctmod will not use any auth headers by default. This may mean a false rate-limit warning, if a user is rate-limited without an auth-token, but not with an auth token. Anytime we call to GitHub we should probably check for an include authorization headers, and that will probably be much easier to do once we have a build_headers_with_authorization util function.

sonic2kk commented 11 months ago

I created the generic method to build headers with authorization. It takes in the original headers, sets 'Authorization' to a blank string (GitHub and GitLab interpret this as no authorization and not as an invalid token, which is good). We then try to get a token for a given type, either 'github' or 'gitlab', and if we have a token we set the authorization token to the relevant token string for that service. Or if no token is found, we return early.

This function takes in the original request headers and returns an updated set of headers, so the original headers are never lost, and only Authorization is updated.

Right now, this is called before each download call in the DXVK ctmod, but it is not called in the DXVK ctmod __init__. This is because I suspect it updates the the MainWindow.rs session headers, and since the DXVK Async ctmod goes last and sets the headers to use the GitLab Authorization, the first ctmod which is the Wine-GE ctmod for Lutris tries to call GitHub using this GitLab auth and fails. That's why the headers are not updated upon instantiation, and why the headers are updated before each API call.

This would be resolved once all ctmods use the new build_headers_with_authorization method. I'm not super happy about calling self.rs.headers.update each time we make an API call, it's just the stop-gap for now until a better way is figured out.


On that note, to go back to something you said earlier:

We could just create multiple requests sessions or modify the header inside the ctmod. Though I think it might be worth reworking the logic completly as we are currently passing the mainwindow object and reading the requests session from it, thus having a high coupling of the modules. That is not ideal.

For a rework, instead of passing MainWindow, we could pass the authorization tokens from MainWindow, and then give each ctmod its own requests.Session. This way they're all independent, but I'm not sure it's a good idea to have all these session objects? I'm not sure if that's a good idea or not, or what the best pattern to follow here is...

An alternative could be, we could have a GitHub and GitLab session with the auth tokens already set, and somehow we get the ctmods to pick these.

sonic2kk commented 10 months ago

Was this ready to be merged? I guess the issue noted is somewhat of an edge case related to passing API tokens, which a lot of users may not do (and restarting ProtonUp-Qt without passing this token would fix the issue).

At least, that was my understanding. I just want to make sure this isn't going to cause any fires by being merged :-)

DavidoTek commented 10 months ago

Oh, you're right. I actually converted the ctmods to use it's own requests session (removed main_window.rs or part). That allowed to only update the headers one. Seems I haven't pushed that to this PR :facepalm: I will do a follow up PR with the changes :sweat_smile:

sonic2kk commented 10 months ago

No problem, I just wanted to make sure.

Just curious really for my own information: Is there any impact here to each ctmod using its own requests session? Was there a reason they all came from main_window.rs before?

Since each will have its own requests session, I guess that means the headers for each ctmod can be set when they're initalized too, and since the GitHub/GitLab util methods take a requests session parameter, this will pick up the API token as needed.

DavidoTek commented 10 months ago

Is there any impact here to each ctmod using its own requests session? Was there a reason they all came from main_window.rs before?

Having a shared requests session seemed to be the simplest method as it only required one requests session object and would contain the necessary headers (only GitHub).

Today I did a test where I would create multiple requests session objects and monitored the RAM consumption while doing so (without downloading any files though). There was no notable impact. I guess the object will only store meta infos like the headers and everything else will just be a reference to the class.

Since each will have its own requests session, I guess that means the headers for each ctmod can be set when they're initalized too, and since the GitHub/GitLab util methods take a requests session parameter, this will pick up the API token as needed.

Yes, that should be possible. I have to check again what's the best method of doing so. Passing self.web_access_tokens (or just accessing it using main_window.web_access_tokens as we need the main_window reference for STL anyway) seems to be the simplest solution. I'm not 100% sure what's the best way for handling child classes. Currently, your solution is to build the header there and passing them as the request_headers parameter . I think we can just update/override the headers in the child class after calling super().__init__..... using self.rs.headers.update( build_headers_with_authorization(...) )


EDIT: ctmod_z0dxvk.py's init looks like this:

    def __init__(self, main_window = None):
        super(CtInstaller, self).__init__()
        self.p_download_canceled = False
        self.release_format = 'tar.gz'

        self.rs = requests.Session()
        rs_headers = build_headers_with_authorization({}, main_window.web_access_tokens, 'github')
        self.rs.headers.update(rs_headers)

ctmod_z1dxvkasync.p's init:

    def __init__(self, main_window = None):
        super().__init__(main_window)

        rs_headers = build_headers_with_authorization({}, main_window.web_access_tokens, 'gitlab')
        self.rs.headers.update(rs_headers)

All other ctmods will include the part too and self.rs is removed from main_window. I will test it and create a PR with the changes

sonic2kk commented 10 months ago

We should be able to refactor other ctmods to use the tag/release fetcher functions from this PR as well once these changes are in place. don't want to step on any toes, so once we sort out the request objects for all ctmods, I can take a look :-)

It might be nice to have all ctmods use these functions so they all share common functionality and don't have repeated logic. Plus it means we only need to change things in one place in future.

Of course, not a huge priority by any means, but if it works the way I'm imagining it in my head, it may become easier to make that ctmod base class a reality!