ConservationMetrics / superset-deployment

Deploy Apache Superset to Azure App Service
Apache License 2.0
1 stars 0 forks source link

`FileBrowser` and Azure deployment issues: persisting links and CSP configuration #19

Closed rudokemper closed 8 months ago

rudokemper commented 10 months ago

Adding this issue here because we don't have a repository for Filebrowser, and it felt more logical to create a ticket here than for frizzle or any of the GC repositories, since this repo has specifically to do with deployment of a third party tool on Azure...

Currently, we are using Filebrowser to share persisting links to media folders, for usage in GuardianConnector Views and elsewhere.

However, whenever a Filebrowser app service is rebooted, any session settings are reset, and this includes any share links generated within the app.

We need to explore if it's possible to create share links to media folders that will persist across Filebrowser sessions, or determine if another tool is better suited for our use case. Luandro from Dd filed an issue for EDT-Offline / Kakawa to explore Filestash instead.

rudokemper commented 9 months ago

There is an additional issue that has come to light with our current Azure storage account + Filebrowser solution: namely, that any HTML URLs are being downloaded directly instead of rendered in the browser, and while appending ?inline=true to the URL does allow the HTML file to load, any externalJS/CSS resources fail to load due to CSP violations.

This is an issue for the output.html file generated by GuardianConnector Change Detection which utilizes MapLibre GL JS and a maplibre-gl-compare library to visualize the before/after scenes used for a change detection alert.

Some of the errors encountered in such an output.html file (the other external resources log similar complaints):

output.html:1 Refused to load the script 'https://unpkg.com/maplibre-gl@3.3.1/dist/maplibre-gl.js' because it violates the following Content Security Policy directive: "script-src 'none'". Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.
Refused to load the stylesheet 'https://unpkg.com/maplibre-gl@3.3.1/dist/maplibre-gl.css' because it violates the following Content Security Policy directive: "style-src 'unsafe-inline'". Note that 'style-src-elem' was not explicitly set, so 'style-src' is used as a fallback.

I tried to go down an Azure rabbit hole to see if I could adjust the CSP directives in the storage account configuration, but I could not figure it out and had to move on to other work. I'm also not sure if this issue truly sits with Azure, or rather with FileBrowser, which does not provide any documentation or configuration options for CSP.

I can continue to look into this but input welcome.

IamJeffG commented 9 months ago

My intuition is that a filebrowser-esque piece of software should not allow running javascript on the pages it serves. I think of a filebrowser similarly to the GitHub website: it serves files as files, not as "apps."

I will think on this and have a look around at what else is out there.

IamJeffG commented 9 months ago

bad news about filestash: no azure support

IamJeffG commented 9 months ago

Both Google Cloud Storage and Azure Storage File Shares seem to have enabled CSP when accessing a blob directly (which is enabled by either making the blob public in blob storage, or using a shared access token.

Here's an example URL with an SaS token that I enabled for this blob only, which expires in 5 hours: https://jgplayground.file.core.windows.net/jgfileshare/output.html?sv=2023-01-03&st=2024-01-02T20%3A23%3A37Z&se=2024-01-03T02%3A23%3A00Z&sr=f&sp=r&sig=r4UtHqo6e5OmCJ94MZrq4Wbn2gqOU8zR%2FjCQJpSuvN8%3D

There are two things going on in this GH issue, and I feel like we should first tackle the harder problem of how to host JS-powered change detection alerts, and only afterward tackle the easier problem of persisting login/share links.

IamJeffG commented 9 months ago

I now understand the reason behind my previous comment's behavior:

Azure and GCP do not provide any Content-Security-Policy header. Absence of this header means wild west, everything is allowed. FileBrowser does provide this header, where default-src 'self' is very restrictive, limiting to content from the same origin (i.e. same domain). This helps protect against loading/executing malicious content from other domains. Or heck, it also prevents some (theoretical) "phone-home tracker" on the blob from reporting file access back to some 3rd party service.

The expedient thing to serve change detection alerts is to set a permissive (or no) Content-Security-Policy, but that might not be the right thing to do (for safety reasons)!

I guess if I were creating this from scratch, we would use a webserver other than FileBrowser to host and render the change detection view (currently output.html), and this webserver (i) sets a restrictive Content-Security-Policy and (ii, optional) ideally self-hosts the 3rd party scripts, i.e. at other endpoints on the same origin. I wonder if there is a world in which the GCV app itself hosts output.html, from side files (binary images, GeoJSON) that are in blob storage.

A middle ground is that GCCD still generates and stores a static output.html, but GCV reads it from blob storage and serves it on its own route with the appropriate headers. For example, a route like:

https://bcm-views.guardianconnector.net/change_detection/alerts/2/2023/11/202311200021321/output.html

serves

app.get('/${params}/output.html', (req, res) => {
  // Read the blob content -- either from Azure directly, or via FileBrowser
  const blobContent = fs.readFileSync('params/output.html 'utf-8');

  // Set the response headers
  res.setHeader('Content-Type', 'text/html');
  res.setHeader('Content-Security-Policy', "scriptSrc 'unsafe-eval'");

  // Send the blob content as the response
  res.send(blobContent);
});

Nothing is set in stone yet, but what do you think about something like this? Am I barking up the wrong tree? It does lock you into only accessing the alert from GCV, but OTOH GCV is what takes care of auth for you.

...acknowledging that we haven't touched the persisting of login/share links yet.

rudokemper commented 9 months ago

My intuition is that a filebrowser-esque piece of software should not allow running javascript on the pages it serves. I think of a filebrowser similarly to the GitHub website: it serves files as files, not as "apps."

I agree, and I would prefer for us to keep using filebrowser in orthodox ways. The only reason I have nevertheless tried to find a way to get filebrowser to serve html pages that can load external resources is because we had not yet discovered a different way to provide persisting links to the resources like the GCCD output.html file. But I appreciate you pushing us to think about the workflow in a different way here:

I wonder if there is a world in which the GCV app itself hosts output.html, from side files (binary images, GeoJSON) that are in blob storage.

I think this is actually a very reasonable thing to consider. There is no inherent reason for GCCD to generate output.html. The only value add that I can think of is that a user may want to download output.html as a standalone html file, but (1) we should first validate that need with our partners as I'm not convinced it's needed, and (2) that should not be difficult to offer as a download option in GCV.

In Vue, this could be a new component with a route /alert_maps/{alert_id} (or similar), which we can link to from within the new alerts dashboard component. I think this would be a more elegant approach than the middle ground you mentioned.

I will scope this approach and circle back, but I expect it will work!

IamJeffG commented 9 months ago

I'll put my work on this issue on hold until we have a plan.

One idea is to use the "middle ground" idea — wrapping the extant output.html — as Milestone 1, since on the outside it looks the same as our final goal but requires less effort to get there. And later follow-up with Milestone 2: porting the creation of output.html from GCCD to GCV.

In any event, I'm thinking that we prefer GCV to have credentials to blob storage and read these files using the blob storage client (or, easier, a volume mount), instead of making GCV continue to go through FileBrowser for this. Would you agree that we can cut out FileBrowser from this flow?

But even if we do cut out FileBrowser from this flow, I think you still want the initial ask, which is persistence of share links, for other GCV integrations like Mapeo images, right?

rudokemper commented 9 months ago

The current process for embedding Mapeo images is the same as that for alert resources: provide a URI to a FileBrowser instance with a share hash. We could use the same flow you are proposing for embedding or linking to any and all data lake resources in GCV, since it's a way to have both persisting links and authentication for data lake files. Given that, then I don't know if the initial ask for persisting links from FileBrowser (or a similar tool) is still necessary.

One possible argument in favor of still having this flow involving FileBrowser is that a future GCV admin (from one of the partner communities / organizations, ITU-3 or NIA persona) might prefer it, as they have visibility and control over the process of generating the share hash in FileBrowser, and therefore the flow might be more legible to them. But that is also a still-distant future in which GCV has something like an admin panel for configuration, instead of being hydrated by environmental variables set in Azure (something Jeff or Rudo need to do). And for that scenario, we can propose a similar-ish flow, assuming that FileBrowser can still access the same blob storage: (1) log in to FileBrowser, (2) get the name of the directory you want to provide to GCV, (3) provide the dir name to your desired table in NUXT_ENV_VIEWS_CONFIG. And the root URI to the volume mount could be a global GCV variable, so the user doesn't have to worry about it once set.

Let me know if that make sense. If it does, then I think the initial ask is void, and your proposed solution will work to resolve this issue.

IamJeffG commented 9 months ago

Having slept on this, I think GCV should use volume mounts, not FileBrowser, if that will be possible. So (and I think this is what you are saying too) the blob storage container is mounted at a local directory name on the machine running GCV, and we tell the app that path using the environment variable.

This would also work nicely with a one-day offline deployment using docker-compose, which allows sharing volume mounts across containers. And analogous volume mounts are also supported by the major cloud providers we are considering.

A possible downside is that GCV needs to be aware of the expected file structure and do its own exploration. But my guess is you already having GCV doing something analogous to that but using share links instead of local folders.

How are you feeling about all this?


Even if we move forward with that plan, I can still imagine a dozen reasons it would be lovely for user accounts, settings, and sharelinks in FileBrowser to not get lost upon reboot! I think we should still fix the initial ask, but it just became less urgent.

bad news about filestash: no azure support

It turns out this doesn't matter. We already use the same volume mount pattern for FileBrowser itself, so we don't need to use Azure Storage APIs. As such FileStash is a contender (though it has some learning curve, I've found out!)

rudokemper commented 9 months ago

A possible downside is that GCV needs to be aware of the expected file structure and do its own exploration. But my guess is you already having GCV doing something analogous to that but using share links instead of local folders.

That's right - the share link structure is in this format https://{filebrowser-domain}/api/public/dl/{share_hash}, and what follows after that base URI in FileBrowser are the subdirectory paths for the folder being shared. These are handled by GCV. So for GCV, switching over would be just be matter of substituting the FileBrowser URI with a volume mount one, and possibly implementing the usage of credentials as needed.

I think we can proceed with this!

Even if we move forward with that plan, I can still imagine a dozen reasons it would be lovely for user accounts, settings, and sharelinks in FileBrowser to not get lost upon reboot! I think we should still fix the initial ask, but it just became less urgent.

Agreed on all counts. We can have it on our task list, but as a low priority item. Partners have yet to actually use FileBrowser, and I'd first like to see if they find it to be useful as a tool in and of itself. (I expect they might, but not sure how often they will use it, and if they end up leveraging settings, sharelinks, etc.)

IamJeffG commented 8 months ago

Filestash definitely has a learning curve!

Filestash does not support Azure blob storage, but does support Azure File Share (which is what we use for the data lake) via SAMBA protocol:

Screenshot from 2024-02-21 10-13-20

Currently I have it running where each user provides their own backend — not what we want. Next up I will see if I can configure the storage backend serverside (so users don't need to bring their own credentials) via config file, and then also hook up auth0 as an OpenID Connect (OIDC) provider.

That said, I am not sold that Filestash is worth it.

IamJeffG commented 8 months ago

Once you "get" how it works, it makes sense, but the documentation will not really help you get there.

Also OpenID (how we'd use auth0) is only available in the enterprise version.

At this point I'm going to take another look at FileBrowser and see if it can solve the problems mentioned in the Issue.

IamJeffG commented 8 months ago

The trick to get FileBrowser to save state between restarts was simple: per https://filebrowser.org/cli/filebrowser I set an environment variable FB_DATABASE to a location outside the docker container, i.e. on our volume-mounted file share. I've confirmed this works by creating a user, stopping/restarting the app, and logging in as that same user. Previously the user would have been deleted upon restart.

I am moving the question of auth0 support to its own GH issue: https://github.com/ConservationMetrics/superset-deployment/issues/21

IamJeffG commented 8 months ago

I've added the FB_DATABASE env var to all our deployments. (This will reset all their share links or user accounts, but I don't think anybody was using them anyway).

At this point I think we can close this issue:

please re-open if any disagreements or questions.

rudokemper commented 8 months ago

Thanks for looking into this! I can confirm that share links are persisting across restarts.

The CSP concern is no longer salient since I've already figured out a different solution to show before/after images on GCV.