acorn-io / runtime

A simple application deployment framework built on Kubernetes
https://docs.acorn.io/
Apache License 2.0
1.13k stars 100 forks source link

Allocating a volume via Acornfile behaves differently than via command line #2335

Open randall-coding opened 10 months ago

randall-coding commented 10 months ago

I'm using Acorn Pro with default region set to "Acorn US East"

The Acornfile approach yields an error, but the command line approach works. Below is the relevant code and the error.

Acornfile:

volumes: db: {
    size: "1G"
    accessModes: "readWriteOnce"
}

volumes: config: {
    size: "1G"
    accessModes: "readWriteOnce"
}

Command line:

acorn run -s opensupports:opensupports \
   -v config,size=1Gi -v db,size=1Gi opensupports

Error using Acornfile:

STATUS: ENDPOINTS[] HEALTHY[0] UPTODATE[0] quota allocation failed: not enough region quota: quota would be exceeded for resources: VolumeStorage; (container: mariadb): pending; (container: website): pending; (volume: config): pending; (volume: db): pending; (secret: env): pending

I'm guessing this has to do with whatever the default storage class is being set to, but I could be wrong.

tylerslaton commented 10 months ago

Hi @randall-coding, thanks for submitting this issue. You are hitting an issue with our SaaS. We have a quota system implemented there that does not allow you to exceed certain limits on resources. With this in mind, could you please go into the SaaS UI (https://acorn.io), click on your username in the top right and then go to Plan and Usage. If that is being exceeded or met, you will see this error.

I do believe that there is a bug here where the Volume is briefly double counted when in the Acornfile and is something I will fix. In the mean time though, this issue will go away given enough time for us to re-reconcile the quota (about every minute).

randall-coding commented 10 months ago

Thanks @tylerslaton . I checked Plan & Usage just now, and it looks like before my last deploy it was

Volume storage 40 GiB 50 GiB

So if the bug is over-counting the volume, it may be doing more than double counting. In my Acornfile I have it set to 1G for each volume, 2 G in total. So at least 5x counting it would seem.

Just wanted to make sure the issue was logged. Unfortunately searching through github issues is a bit limited, so if I missed it my apologies.

tylerslaton commented 10 months ago

Thanks a ton for submitting the issue! We'll get on working through this problem, in the mean time waiting a brief time in that state should eventually reconcile to the actual amount of storage. Is that the case for you as well?

randall-coding commented 10 months ago

I will give that a try. I'm working on something else tonight so I will follow up a little later.

tylerslaton commented 10 months ago

Additionally, would you mind providing the full Acornfile that you used, @randall-coding?

randall-coding commented 10 months ago

Today deployment is working when allocating via Acornfile with no allocation errors. This is even when having less available space than before (since I didn't delete my first good deployment that used command line allocation).

Here is the Acornfile. I just changed the names to add a "2" on them since my other deployment has volumes with the same names.

containers: {
    website: {
        image: "gamelaster/opensupports:latest"
        ports: publish: [
            "80:80/http"
        ]
        env: {
            TIMEZONE: "secret://env/timezone"
        }
        dirs: {
            // "/config": "volume://data?subpath=web"
            "/config": "volume://config2"
        }
        dependsOn: ["mariadb"]
    }
    mariadb: {
        image: "mariadb"
        dirs: {
            // "/var/lib/mysql": "volume://data?subpath=db"
            "/var/lib/mysql": "volume://db2"
        }
        env: {
            MYSQL_USER: "opensupports"
            MYSQL_DATABASE: "opensupports"
            MYSQL_RANDOM_ROOT_PASSWORD: "true"
            MYSQL_PASSWORD: "secret://env/mysql_password"
        }
    }
}

secrets: env: {
    external: "opensupports"
}

// TODO trying to use subpath, but volume hangs on provisioning
// volumes: data: {
//     size: "10G"
//     accessModes: "readWriteMany"
// }

volumes: db2: {
    size: "1G"
    accessModes: "readWriteOnce"
}

volumes: config2: {
    size: "1G"
    accessModes: "readWriteOnce"
}
randall-coding commented 10 months ago

I will keep an eye on the Usage page next time this happens

randall-coding commented 10 months ago

I wonder if this issue stemmed from some kind of caching going on as well as the double counting. I recall at first I had the storage set higher to begin with in the Acornfile when I saw the error. I then deleted the deploy + orphaned resources and lowered the storage request each time.

smw355 commented 10 months ago

It feels like the real bug here might be not clearly explaining the limitation that is being hit?

randall-coding commented 10 months ago

It feels like the real bug here might be not clearly explaining the limitation that is being hit?

I was thinking that too. It would be helpful if it explained the numbers involved. Though it still seems like there is an issue with over counting the allocation when using the Acornfile.

I ran it with the Acornfile multiple times at only 2G total volume needed and it failed. Then I did the same thing with command line allocation and it works. So there is likely some different behavior going on.

The same thing happened with the cs2 server 40G volume. It was "stuck" on this allocation limit with the Acornfile, but then command line allocation worked with no errors. I was deleting and redeploying each time.

randall-coding commented 10 months ago

edit: I was about to explain a new finding, but nvm, that was human error.

tylerslaton commented 10 months ago

TL;DR, there's definitely an issue we need to fix on our side. For your specific case, waiting a minute should cause the issue to go away on its own.

@randall-coding Just getting another chance to take a look at this. Using your Acornfile, I was able to replicate your issue. There is a race condition on our side that we need to fix. The way that we count Volumes at application create time is slightly off which causes them to be counted against quota multiple times (sometimes not just double). I also confirmed that this is reconciled after 1 minute of time passes as a result of some continuous syncing logic that we implemented for edge cases like this.