community-scripts / ProxmoxVE

Proxmox VE Helper-Scripts (Community Edition)
https://Helper-Scripts.com
MIT License
5.3k stars 319 forks source link

Big-bang bang update (#333) broke most of my LXC updates #404

Open dsiebel opened 6 days ago

dsiebel commented 6 days ago

Please verify that you have read and understood the guidelines.

yes

A clear and concise description of the issue.

In #333 the resource checks were changed in a way that now breaks almost all my LXC updates, e.g.:

⚠️  Required: 1 CPU, 1024MB RAM | Current: 1 CPU, 512MB RAM
Please ensure that the Uptime Kuma LXC is configured with at least 1 vCPU and 1024 MB RAM for the build process.

I fine-tuned the resources of my LXC containers to what they actually need, the new resource checks however would require me to:

  1. switch back to default resource values (might not be available on the PVE node)
  2. reboot the container (downtime)
  3. run the update
  4. switch back to chosen values (downtime, again)

Am I holding this wrong? Is there a better way to do the updates?

This only happened after running sed -i 's/tteck\/Proxmox/community-scripts\/ProxmoxVE/g' /usr/bin/update to switch to the new repo (😢). The old update scripts, still work fine, but will be outdated soon enough.

To be honest, I think the previous approach of briefly over-provisioning the container, in case of resource intensive build processes, was perfectly fine. Especially considering the resources were reverted, after install or update, to the chosen/previous values, which makes this effortless for the user.

What settings are you currently utilizing?

Which Linux distribution are you employing?

Debian 12

If relevant, including screenshots or a code block can be helpful in clarifying the issue.

No response

Please provide detailed steps to reproduce the issue.

  1. Adjust the resources on any LXC container, so they deviate from the defaults.
  2. Migrate the update scripts to the new community-scripts repo: sed -i 's/tteck\/Proxmox/community-scripts\/ProxmoxVE/g' /usr/bin/update
  3. run update
MickLesk commented 5 days ago

Under-provisioning LXC containers is generally not a recommended practice. Containers only use the resources they actually need, regardless of what is allocated to them. For example, even if you assign 64 GB of RAM, a container requiring only 250 MB will only use that amount. Setting resource limits too low can lead to avoidable issues, particularly during resource-intensive tasks like build or update processes.

We introduced these checks to ensure stability and avoid scenarios where builds fail or installations break due to insufficient resources. This was a frequent issue in the past and caused significant frustration for users. While the old approach of temporarily over-provisioning resources might have worked in some cases, it also introduced its own challenges, especially for users who were unaware of these requirements.

The current method ensures that containers meet the minimum resource requirements upfront, providing a more reliable experience overall. If you prefer to maintain custom resource limits, you could consider temporarily increasing resources before running updates and then reverting them afterward. However, consistently meeting the minimum requirements for updates will likely save you time and effort in the long run.

dsiebel commented 5 days ago

That is all true and valid and I totally understand the reason why it was done. I too was one of the users who was having issues with a container once that had insufficient resources to run a go build during updates.

Another reason why setting different resource limits, specifically memory, is to help with garbage collection. While it is true that containers will only consume the resources they need, many (if not all) garbage collected languages actually obey cgroup limits and it has a direct impact on garbage collection and the consumed resources. Hence, running at lower resources limits can help reduce the amount of resources a container thinks it needs.

I am running 40 LXC containers on my Proxmox VE. Having to temporarily update the resource settings to run an update turns a simple, automatable task into a full day of work (see steps involved above).

So, yeah, I basically wanted to gauge the interest in changing the current approach. One could think about still allowing updates if the container's resource "requirements" aren't met, maybe with a simple y/n prompt.

MickLesk commented 5 days ago

The problem is, if I add the Y/N question, and we update e.g. an intensive script like Vaultwarden or others, the user has only 4GB RAM and during the update process the installer flies away, so we are only bugfixing. If it is still possible at all, because the process has deleted the folder beforehand. In other words, we are talking about potentially major data loss, which will probably come back to us last.

One idea would have been that you can have a 10-25% deviation from the recommended resources, but even then, we can't ensure that everything is running -> so we get bug reports again.

The right way would rather be to size the LXC's properly, e.g. if you say LXC “Test” has 2 cores and 2 GB RAM according to our specification, but only needs 786MB RAM even when updating, you could scale it down.

dsiebel commented 5 days ago

The problem is, if I add the Y/N question, and we update e.g. an intensive script like Vaultwarden or others, the user has only 4GB RAM and during the update process the installer flies away, so we are only bugfixing. If it is still possible at all, because the process has deleted the folder beforehand. In other words, we are talking about potentially major data loss, which will probably come back to us last.

I understand. But having a prompt with a prominent warning would make it an easy close, no? Maybe differentiating between scripts that run builds and scripts that don't? So, I guess what I am asking here is to move some responsibility back to the user (#keinBackupKeinMitleid 😅).

One idea would have been that you can have a 10-25% deviation from the recommended resources, but even then, we can't ensure that everything is running -> so we get bug reports again.

Yeah, I don't think that 10-25% are going to cut it. In most cases I could reduce the resource by 50% or less.

The right way would rather be to size the LXC's properly, e.g. if you say LXC “Test” has 2 cores and 2 GB RAM according to our specification, but only needs 786MB RAM even when updating, you could scale it down.

That is kind-of the problem. If I scale the container down but the specs are set to higher values, it won't let me update. Or am I misunderstanding your point here?

havardthom commented 5 days ago

Adding a data loss warning and requiring a full yes in a prompt like this might be acceptable @MickLesk ?

  if [[ "$current_ram" -lt "$var_ram" ]] || [[ "$current_cpu" -lt "$var_cpu" ]]; then
    echo -e "\n⚠️${HOLD} ${GN}Required: ${var_cpu} CPU, ${var_ram}MB RAM ${CL}| ${RD}Current: ${current_cpu} CPU, ${current_ram}MB RAM${CL}"
    echo -e "${YWB}Please ensure that the ${APP} LXC is configured with at least ${var_cpu} vCPU and ${var_ram} MB RAM for the build process.${CL}\n"
    read -r -p "⚠️ May cause data loss! ⚠️ Continue update with under-provisioned LXC? <yes/No>  " prompt  
    # Check if the input is 'yes', otherwise exit with status 1
    if [[ ! ${prompt,,} =~ ^(yes)$ ]]; then
      echo -e "❌${HOLD} ${YWB}Exiting based on user input.${CL}"
      exit 1
    fi
  else
    echo -e ""
  fi
MickLesk commented 5 days ago

Can we do, if there are many issues in the next time because this, we revert

dsiebel commented 5 days ago

I can also imagine a rollback mechanism, that would not necessarily incur data loss, by having the old release next to the new one and switching symlinks on them. It's a pattern I've seen quite often in real world projects. Not sure if that's feasible for all the scripts, since I don't know all of them (yet). But it's definitely something i'd be willing to give a try, when I find the time.

MickLesk commented 5 days ago

Do an Example with 2-3 CT.sh and then we take a look.

dsiebel commented 5 days ago

I think I will look at these for the implementation, for reasons mentioned below:

If you have other suggestions which ones to look at, please throw them my way. Will there still be an implementation of @havardthom 's suggestion in the meantime?

EDIT: while looking through a few more update scripts I noticed that in many cases, like zigbee2mqtt for example, the dreaded data loss can be easily avoided by properly configuring the runtime and not store the user data inside the git checkout, i.e. setting ZIGBEE2MQTT_DATA.

EDIT2: paperless-ngx might be an interesting candidate too, for the same reason as the above (data paths are not set to external dir).

Aloe-recite commented 4 days ago

Hi, sorry for intervening in a technical discussion despite being a complete newbie. I'm pretty sure that allocating RAM and CPU to a LXC does not require a reboot (unlike VMs). Having to modify them manually before and after updating is still a hassle but at least no downtime. Maybe the update process could temporarily assign the required resources for the update and then revert to what they were before? Since the assigned resources are not actively used unless needed and act more as a cap/limit than a real assignation, thi shouldn't werak havoc, although maybe this requires running the update from the host to interact with resources (unless a container is allowed to alter them from inside, which seems unlikely). Honestly, I'll use the future updates to revert to the default resources, I manually tweaked many of them but I recognized that I did it for no real, practical, tested and tangible reason.... Sorry if I wasted your time by saying something stupid!

MickLesk commented 4 days ago

Yes, you are right, LXC resources can be adjusted as desired during active operation.

This is not possible in the update script because we are not in the Proxmox main node. In other words, you cannot manipulate the RAM and CPU values via this location, this is only possible from the main node. This is the reason why the install scripts once contained the following “Set resources to normal”, because this was executed via the Proxmox node and was therefore technically possible.

To realize that, all >220 scripts would have to be completely rebuilt, a completely new logic would have to be created. That's not going to happen. Also from a security point of view.

Aloe-recite commented 4 days ago

Thanks, makes perfect sense. Sorry for butting in!