Not an issue but a note on start-up time using HTTPSTAT

ndrean commented 8 months ago

I used httpstat to get some stats on a cold start vs a warm start to get an idea on the state of the current app (only Image-To-Text models are loaded).

The first run:

> httpstat https://imgai.fly.dev/
Connected to 66.241.125.29:443 from 192.168.1.5:64124

HTTP/2 200
cache-control: max-age=0, private, must-revalidate
content-length: 8413
content-type: text/html; charset=utf-8
date: Thu, 18 Jan 2024 00:22:27 GMT
referrer-policy: strict-origin-when-cross-origin
server: Fly/f9c163a6 (2024-01-16)
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F6tJLnQnBu3CmmYAADEB
set-cookie: _app_key=SFMyNTY.g3QAAAABbQAAAAtfY3NyZl90b2tlbm0AAAAYbzd4SFFLRnFSZ1NtbEJwcTJTQVhjZjVV.0GiYEeXmsOhTCnmDqPd8Wm72IiPnHhM-HT43sv29Ev0; path=/; HttpOnly; SameSite=Lax
via: 2 fly.io
fly-request-id: 01HMCZ4KG7X6A7YNYWA7QKYBHQ-bog

Body stored in: /var/folders/mz/91hbds1j23125yksdf67dcgm0000gn/T/tmppjsjwz6x

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     6ms    |      25ms      |     40ms      |      21360ms      |        2ms       ]
             |                |               |                   |                  |
    namelookup:6ms            |               |                   |                  |
                        connect:31ms          |                   |                  |
                                    pretransfer:71ms              |                  |
                                                      starttransfer:21431ms          |
                                                                                 total:21433ms

The next run is a "warm" start:


❯ httpstat https://imgai.fly.dev/
Connected to 66.241.125.29:443 from 192.168.1.5:64465

HTTP/2 200
cache-control: max-age=0, private, must-revalidate
content-length: 8413
content-type: text/html; charset=utf-8
date: Thu, 18 Jan 2024 00:23:27 GMT
referrer-policy: strict-origin-when-cross-origin
server: Fly/f9c163a6 (2024-01-16)
x-content-type-options: nosniff
x-download-options: noopen
x-frame-options: SAMEORIGIN
x-permitted-cross-domain-policies: none
x-request-id: F6tJPGwAMBcXAjgAADGB
set-cookie: _app_key=SFMyNTY.g3QAAAABbQAAAAtfY3NyZl90b2tlbm0AAAAYcDczUTFONkdQaFhlWGcwcll3c3EzdlJS.OKXegdWNqMqAzynS2M4Rm-L5ov_XnR9XHrOxtbT9ZuQ; path=/; HttpOnly; SameSite=Lax
via: 2 fly.io
fly-request-id: 01HMCZ72FQBY3K289TTHBKFV84-bog

Body stored in: /var/folders/mz/91hbds1j23125yksdf67dcgm0000gn/T/tmppf36b4u6

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     1ms    |      33ms      |     34ms      |       489ms       |        1ms       ]
             |                |               |                   |                  |
    namelookup:1ms            |               |                   |                  |
                        connect:34ms          |                   |                  |
                                    pretransfer:68ms              |                  |
                                                      starttransfer:557ms            |
                                                                                 total:558ms

ndrean commented 8 months ago

Another trial today:

A cold start:

Connected to 66.241.125.29:443 from 192.168.1.5:53615

HTTP/2 502
server: Fly/f9c163a6 (2024-01-16)
via: 2 fly.io
fly-request-id: 01HMKTKPQBGN3XRXEY95MKF2W5-bog
date: Sat, 20 Jan 2024 16:19:17 GMT

Body stored in: /var/folders/mz/91hbds1j23125yksdf67dcgm0000gn/T/tmp893dwse9

  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[    45ms    |      25ms      |     203ms     |      98992ms      |        0ms       ]
             |                |               |                   |                  |
    namelookup:45ms           |               |                   |                  |
                        connect:70ms          |                   |                  |
                                    pretransfer:273ms             |                  |
                                                      starttransfer:99265ms          |
                                                                                 total:99265ms

Seems that nothing is persisted on disk and that you have to download everything? The "warm start" (rerun the command after the first return) gives totally acceptable results:


  DNS Lookup   TCP Connection   TLS Handshake   Server Processing   Content Transfer
[     1ms    |      27ms      |     29ms      |       485ms       |        1ms       ]
             |                |               |                   |                  |
    namelookup:1ms            |               |                   |                  |
                        connect:28ms          |                   |                  |
                                    pretransfer:57ms              |                  |
                                                      starttransfer:542ms            |
                                                                                 total:543ms

ndrean commented 8 months ago

@LuchoTurtle Some thoughts I suppose you already went through. Is Fly pruning the Docker images? And what if you use a Fly volume and reference it as a (persistent) Docker volume? It would be populated once and for all.

You run the upload the image model in "Application.ex" so I also need to upload the whisper model.

I tried to load in parallel the models but for some reason this does not give any speed.

#Application.ex

@models_folder_path Application.compile_env!(:app, :models_cache_dir)
@captioning_prod_model %ModelInfo{
    name: "Salesforce/blip-image-captioning-base",
    cache_path: Path.join(@models_folder_path, "blip-image-captioning-base"),
    load_featurizer: true,
    load_tokenizer: true,
    load_generation_config: true
  }

@whisper_model %ModelInfo{
  name: "openai/whisper-small",
  cache_path: Path.join(@models_folder_path, "whisper-small"),
  load_featurizer: true,
  load_tokenizer: true,
  load_generation_config: true
}

def start(_type, _args) do
    [
     @whisper_model,
      @captioning_prod_model,
      @captioning_test_model
    ]
    |> Enum.each(&App.Models.verify_and_download_models/1)

    # this "async upoad" isn't faster ???
    #|> Task.async_stream(&App.Models.verify_and_download_models/1), timeout: :infinity)
    #|> Enum.to_list()
    [...]

LuchoTurtle commented 8 months ago

I've documented everything in https://github.com/dwyl/image-classifier/blob/main/deployment.md regarding deployment to fly.io.

I'm indeed using a volume to store the models and last time I deployed, everything seemed to be working. The models were downloaded after deploying and being used first and then used in subsequent runs. In fact, because the way the models are being served with offline: true, it's programmatically enforced that the models have to be used locally, or else the app won't run.

As you know, I've gone through this situation of persisting models quite a few times: first by changing the Dockerfile (as you are aware) and then to the current solution.

I can see your activity on the logs. Here's the volume being mounted:

2024-01-21T20:11:52.780 app[28659e0b5936e8] mad [info] INFO Mounting /dev/vdb at /app/.bumblebee w/ uid: 65534, gid: 65534 and chmod 0755

As you know, when a model is downloaded, a message like [info] Downloading Salesforce/blip-image-captioning-base... appears. It appears to be the case:

2024-01-21T17:27:02.541 app[28659e0b5936e8] mad [info] 17:27:02.539 [info] Downloading Salesforce/blip-image-captioning-base...

Unless the volumes are actively being pruned when downscaling due to inactivity, I don't understand this behaviour :(

Thank you for sharing httpstat though, it seems like an awesome tool that I will probably start using :)

ndrean commented 8 months ago

Yes indeed, seems that volumes are pruned when the machine is killed.

Maybe we could save these 3 models into a Postgres blob field (large object)? A DB is persisted, and a db_query/copy_if_not_exists should be faster option?

I may try this

LuchoTurtle commented 8 months ago

That seems like a plausible option (and to be quite frank, the only option we probably have given that we want the machines to scale down with inactivity). It sucks that we have to undergo a "hacky way" to get it to work :(

But, as much as I'd love to do that, I don't think it's pertinent (at least to my/this repo's scenario). Volumes shouldn't be pruned if downscaled :( . The strategy that is documented should work file in most cases, so I don't really feel the need to try to save models within a relational database, it just seems counter-intuitive and may stray beginners to think it's ok, when it's not really suitable for this case.

Although I appreciate your feedback (I really, really, really do), you can try it for yourself if you want. But I don't see myself hacking my around and saving models into a database and all the headache that may come along with it. I'm really excited in actually getting the audio transcription PRs you've implemented and then work from there :D

ndrean commented 8 months ago

I am curious but you are wise, so this project does not need this. I may probably lok into this one day. Now from the docs, they seem to discourage this for "big" files, and is limited to 1Gb when of type "bytea" or "text". Note that the models used here contain 900Mb files but "large" models are over 1Gb. One point I don't understand though is why I can't do a // download. I keep this for later and seek for help.

ndrean commented 8 months ago

@LuchoTurtle I suppose no one is using the Fly machine yet so the Fly must have been stopped/pruned since last time. Can you check the status of the Fly volumes now to see if there is still there?

LuchoTurtle commented 8 months ago

Unfortunately, I can't fly ssh console to the volume to see its contents without initializing a VM (which would prompt the models to be re-downloaded again, according to our theory). The best I can do is checking the size that is being "occupied".

Since the models are usually 1GB, I can assume the volume is being cleaned up :/

ndrean commented 8 months ago

I read volumes forks. Could this one be permanent??

The new volume is in the same region, but on a different physical host, and is not attached to a Machine. The new volume and the source volume are independent, and changes to their contents are not synchronized.

I understand Dwyl is a "real" customer, aren't you? Any chance to use this help to use a fork as a backup? Fly may be reactive with "real" customers? 🤔

nelsonic commented 8 months ago

@LuchoTurtle feel free to invite @ndrean to the org: https://fly.io/dashboard/dwyl-img-class/team to debug this.

dwyl / image-classifier

Not an issue but a note on start-up time using HTTPSTAT #46