allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
364 stars 132 forks source link

after upgrading to 1.16.0, images dont load in web UI. #248

Open mathematicalmichael opened 3 days ago

mathematicalmichael commented 3 days ago

I'm using the docker-compose stack. Basically everything to recreate my set up is here: https://github.com/ml-starter-packs/clearml-lightning except I bumped my version to 1.16.0

after upgrading my image tags to the latest release, I noticed that the clearml-fileserver emits Error getting token whenever I try to load images in the Plots tab.

image

data still works fine, and oddly enough so do Debug Samples despite them also coming from the fileserver. Those load fine... image

(top: manual download from web UI in artifacts tab... works fine, seems to authenticate happily) (middle: errors when I load the Plots tab, first screenshot) (bottom: opening the Debug Samples tab)

downgrading to 1.15.1 (as is in the repo linked above) restores all images in Web UI.

jkhenning commented 2 days ago

Hi @mathematicalmichael,

The new server version has added the built-in authentication for the fileserver, I assume that for some reason (perhaps due to the fileserver url you're using?) the WebApp does not identify the fileserver and thus is not attaching the cookie when trying to download (the SDK obviously does that).

You can try disabling this feature in the fileserver (using the fileserver.conf file) by setting auth.enabled: false (you can also do that in the docker-compose or in the docker compose override file with an environment variable) and see if it helps

mathematicalmichael commented 2 days ago

thanks. yes, I know the new version has auth, which is exactly what I want / need (in fact). So I do not want to disable it (though it's no better than downgrading, I know).

Could it be because the fileserver is not at the subdomain "files.."?

(unfortunately I don't have control over subdomain names)

I do however, have some cycles to try and fix it, if it's possible. I just need guidance on the cause of the issue / feasibility of solutions.

jkhenning commented 2 days ago

In general, the server is configured to place the cookie with a specific domain - I assume the cookie is simply not propagated to the fileserver since it's hosted under a different domain name - in general, if the two services are hosted under some parent domain name (like app.my-domain.com and files.my-domain.com) its simply possible to set the cookie domain to the common domain name (e.g. .my-domain.com) Can you share the pattern of the domains you're using?

mathematicalmichael commented 2 days ago

@jkhenning thank you! so it sounds like my suspicion might have been directionally correct and that the cookie's scope is missing our URLs.

The networking set up I am constrained to with this particular ClearML deployment has the following structure:

https://<port>-<hash tied to EC2 instance>.<domain>.<tld>

so my setup is https://8080-....site.com https://8081-....site.com https://8008-....site.com

setting it to .site.com would be a security concern: way too broad a scope. each EC2 instance gets its own URL.

I wrote this part of the ClearML docs:

image

so I very much remember dealing with this on an earlier deployment (but one where I had control over subdomain names)

I was surprised when the deployment "just worked" with this new domain mapping (for this deployment), but I realize now that was because the fileserver was totally insecure until 1.16.0, so the domain didn't matter. We've been using these urls for six months now, so I'm not sure the aforementioned docs are "exactly correct" anymore.

mathematicalmichael commented 2 days ago

that all said... take a look at my logs again. Notice that the Debug Images load just fine from the web app, and they're served behind the same backend fileserver URL.

So... what does that tell us about that cookie's scope... When one tab in the ClearML Web UI is able to load assets from the fileserver, but the neighboring tab does not???

jkhenning commented 2 days ago

Ah, this might be a WebApp issue, some plots (which are too complicated to be stored as a plotly object) are stored as an image, but the link is embedded in the plot object, which means the WebApp has to parse it and decide whether to attach the cookie there, I think the WebApp only knows how to automatically do that for the standard port variants and the standard subdomains. You should be able to explicitly specify the fileserver URL to the webapp by adding the following env var to the webapp service: WEBSERVER__fileBaseUrl=https://8081-....site.com

mathematicalmichael commented 2 days ago

ooh Ill try that env var! thank you!

but I'm not sure that explains why Debug Images work while Plotly image embeds do not. Is it because the two structure the urls differently?

(and I explicitly save some as images for better control over formatting - e.g. histograms. I send some to Debug and some to the Plots tab. Debug tab works, Plot does not. same underlying fileserver url structure, but console logs show 401 only on the latter)

is the scope of the cookie a problem given how the urls are structured? other customers (not us) using the same reverse proxy would have urls with the same domain name, and I dont want those to be valid against my instance...

mathematicalmichael commented 2 days ago
    environment:
      CLEARML_WEB_HOST: ${CLEARML_WEB_HOST:-}
      CLEARML_API_HOST: ${CLEARML_API_HOST:-}
      CLEARML_FILES_HOST: ${CLEARML_FILES_HOST:-}
      WEBSERVER__fileBaseUrl: ${CLEARML_FILES_HOST:-}

yields

Error parsing WEBSERVER__fileBaseUrl JSON value `https://8081-.....com/`: Expecting value: line 1 column 1 (char 0)

if I prepend CLEARML_ to the front of it... it does not complain. Does work with that env var on 1.15.1

and upgrading to 1.16.0 with that prepended env var does not bring back the images (had to force refresh to avoid browser cache tricking me)

jkhenning commented 2 days ago

I guess you should put it in quotes?

mathematicalmichael commented 2 days ago

one thing I noticed poking around the console: the requests that are getting the 401 from Plot tab do not have a cookie set in the request header. the requests that succeed from the Debug Samples tab do have a cookie set in the request header

mathematicalmichael commented 1 day ago

I guess you should put it in quotes?

tried that, both single and double quotes still throw the same message.

I'm pretty sure the problem is that the cookie isn't set by the template that renders out the plotly images.

jkhenning commented 1 day ago

It's possible docker compose removes the quotes, can you perhaps try: WEBSERVER__fileBaseUrl: \"${CLEARML_FILES_HOST:-}\"

mathematicalmichael commented 1 day ago

@jkhenning unfortunately that also throws the same Error parsing error.

to my comment about the browser Inspect tool showing a missing cookie (but valid artifact url) in the requests that are 401'ing... could this possibly explain the situation? (cookie not set in the first place)