bayasdev / envycontrol

Easy GPU switching for Nvidia Optimus laptops under Linux
MIT License
1.26k stars 61 forks source link

Add transparent caching, --cache-create, --cache-delete and --cache-q… #155

Closed klmcwhirter closed 7 months ago

klmcwhirter commented 7 months ago

…uery options

~As mentioned elsewhere please DO NOT MERGE until a commit shows up with fixes 79 in it.~

~TODO:~ ~ more testing~ ~ update README - https://github.com/klmcwhirter/envycontrol?tab=readme-ov-file#%EF%B8%8F-usage - what else?~

klmcwhirter commented 7 months ago

@bayasdev I installed akmod-nvidia from rpmfusion on my Fedora 39 install and tested all state transitions. All tests passing.

I believe we are good to merge this now.

Thanks for all the help!

bayasdev commented 7 months ago

Thanks @klmcwhirter!

I'll check this on the weekend but LGTM and works as expected on my Ubuntu 23.10 test machine.

However, I don't think pushing a new release immediately after merging this, so we can see if anyone living on the edge pulls EnvyControl from main branch and reports an issue related to the cache system 😆

klmcwhirter commented 7 months ago
 def create_cache_obj(self, nvidia_gpu_pci_bus):
        from datetime import datetime
        return {
            'switch': {
                'nvidia_gpu_pci_bus': nvidia_gpu_pci_bus
            },
            'metadata': {
                'audit_iso_tmstmp': datetime.now().isoformat(),
                'args': vars(self.app_args),
                'amd_igpu_name': get_amd_igpu_name(),
                'current_mode': self.current_mode,
                'display_manager': get_display_manager(),
                'igpu_pci_bus': get_igpu_bus_pci_bus(),
                'igpu_vendor': get_igpu_vendor(),
            }
        }

We can cache only the nvidia_gpu_pci_bus for now, major changes should be delayed for version 4. That would simplify a lot this PR.

It won't change much, but as promised earlier I will make the change here and in README.md.

You see one of the most challenging aspects of troubleshooting a risky system change like this is having enough visibility into the information used by a process to make decisions. And, you can never predict which information might be needed given a particular scenario. It is best to have everything.

Putting it all in the new cache file is the least risky approach - as opposed to, say, adding a set of rolling log files in /var/log/envycontrol/...

Change made. I will puish it up once I have the README adjustments made and perform asnother round of testing.

Here is what the cache file looks like now:

$ sudo python -m envycontrol --cache-query
{
    "switch": {
        "nvidia_gpu_pci_bus": "PCI:1:0:0"
    }
}
klmcwhirter commented 7 months ago

Please update the object so it looks like this:

{
    "nvidia_gpu_pci_bus_id": "PCI:1:0:0"
}

That will take more time and retesting. I will look at it tomorrow when I have more time.

bayasdev commented 7 months ago

@klmcwhirter

You see one of the most challenging aspects of troubleshooting a risky system change like this is having enough visibility into the information used by a process to make decisions. And, you can never predict which information might be needed given a particular scenario. It is best to have everything.

If the user does not provide enough system logs we can mark their issue as invalid.

Putting it all in the new cache file is the least risky approach - as opposed to, say, adding a set of rolling log files in /var/log/envycontrol/...

Writing to syslog is a good idea for version 4, that would be much cleaner than caching user's system data that will eventually get stale.

Edit: If you feel comfortable you can add a trace command in another PR (or delay it for v4) that gets fresh data from the user's system for troubleshooting purposes. It's always possible for an user (specially Linux ones) to change their DE/WM, etc so caching such data makes no sense.

klmcwhirter commented 7 months ago

Please update the object so it looks like this:

{
    "nvidia_gpu_pci_bus_id": "PCI:1:0:0"
}

That will take more time and retesting. I will look at it tomorrow when I have more time. Oops - this was not posted yesterday.

I went through the 6 switch operations and 6 reboots to test and I think it looks good.

But, that is it. Take it or leave it. I won't be making any more changes and think the last 2 are a mistake.

As mentioned I will be using the version from my tests branch. The reason is that I am only getting about 2-1/2 hr battery life in integrated mode. Which means I either need to find the magic bullet thatr is missing in the config mods, or accelerate the process of buying a laptop without nvidia hardware. FYI.

Push is coming in a minute.

This is what my test plan looks like.

  1. envycontrol --reset and reboot
  2. envycontrol -s integrated and reboot
  3. envyvcontrol -s nvidia and reboot # use cache in this step
  4. envycontrol -s hybrid and reboot
  5. envycontrol --cache-delete # force recreating cache in next step
  6. envycontrol -s nvidia and reboot
  7. envycontrol -s integrated and reboot

All state transitions happened without issue.

Enjoy!

bayasdev commented 7 months ago

Thanks @klmcwhirter! this is exactly what I needed the cache to be. Will issue a new release in the following hours.

klmcwhirter commented 7 months ago

Glad I could help. As mentioned, my tests branch will remain and I'll be adding more details to help troubleshoot to the cache file there.

Feel free to borrow code as you would like from that branch.

But to be honest, I probably will time box my effort there.

I have 2 projects on hold while I have been working on this. One of them is my pi-day-2024-with-py project ;)