eclipse / openvsx

An open-source registry for VS Code extensions
https://open-vsx.org/
Eclipse Public License 2.0
1.26k stars 142 forks source link

Are we managing extension files efficiently #980

Open kineticsquid opened 2 months ago

kineticsquid commented 2 months ago

The origin of this is https://github.com/EclipseFdn/open-vsx.org/issues/2317. Ultimately our objective is, for a variety of reasons, to keep the size of our DB manageable. The file_resource table is by far the largest. I poked at this a bit in a Gitpod workspace. In that sample workspace, there are 22 extensions, 40 versions, and a little over 99K entries in the file table. It looks like, in addition to the files listed in the extension package.json file, e.g. readme, download, license, icon, all the files included in the .vsix file are also listed in this table. The type is resource: https://raw.githubusercontent.com/kineticsquid/openvsx/master/output.txt.

Separately, looking at a sample of the open-vsx.org access logs, I can see /file API calls requesting these files.

I can understand the access to the icon, license, readme files for all the versions of an extension. I don't understand the logic that results in access requests to these other files. Unfortunately, I don't understand the code enough to figure out where these calls are coming from, UI or server.

@amvanbaren @filiptronicek @spoenemann Any insight or background on this?

amvanbaren commented 2 months ago

Some context: https://github.com/eclipse/openvsx/issues/432

I've deployed a proof of concept to staging where resource files are extracted from the vsix package on the fly. The response to the initial request is slower, but reasonable (2 - 3 seconds). The response is cached for 30 days, so subsequent requests are faster. The upside is less rows in the file_resource table and less files in blob storage. The downside is slower response times and most likely higher bandwidth usage. This can be an acceptable trade-off if only a limited set of resource files are requested, e.g. 80% cached responses and 20% on the fly generated responses.

kineticsquid commented 2 months ago

@amvanbaren Thanks, this seems a reasonable approach. But before we go there, I'd like to understand a bit more about the use case(s).

amvanbaren commented 2 months ago

Where are these calls coming from and why is the info returned by our API insufficient?

These calls are coming from VS Code based editors. The info returned by the API was insufficient, because it was only returned for extensions with web in their tag list.

Is this specified in package.json like this?

The tags list in the extension.vsixmanifest file was used.

What web resource files are extracted (presumably and entered in the file_resource table) and not extracted for other types of extensions?

All files in the extension folder were recursively added to the file_resource table. So basically all files in the vsix package.

I could imagine imagine optimizations for extensions management in IDE UIs, but wouldn't that require only the files from the latest version?

This is not on the extension level. A resource is requested for a specific version of an extension.

Is there a use case for this?

Yes, it is to keep feature parity with the MS VS Code API: https://ms-python.vscode-unpkg.net/ms-python/python/2024.14.0/extension/out/client/node_modules/ https://open-vsx.org/vscode/unpkg/ms-python/python/2024.14.0/extension/out/client/node_modules/

kineticsquid commented 2 months ago

@amvanbaren Thanks for the additional info, this is helpful. A couple of follow up questions.

amvanbaren commented 2 months ago

Once an extension is installed, an editor presumably has all the files.

Yes, I think the desktop editor uses local files. It looks like this functionality is used by VS Code server deployments, like the Gitpod openvscode-server.

so are these calls made for information for extensions that are not installed in the IDE?

That could be possible too.

Does Theia make similar calls?

It can through the /api/file/... endpoints by file path, but a quick look at the Theia source code makes me think it only uses predefined file urls (download, icon, manifest, etc.) https://github.com/eclipse-theia/theia/blob/19556f4d90c1b661ba53caea9b6a035a714e112d/dev-packages/ovsx-client/src/ovsx-types.ts#L198

kineticsquid commented 2 months ago

@amvanbaren we believe we're seeing calls (not through the /file API from Gitpod. What about VS Codium?

Ultimately I think we're going to need to sample our access logs again to get a better picture. Can you recommend a text filter to limit the entries we're looking for?

amvanbaren commented 2 months ago
File endpoints - `/api/{namespace}/{extension}/{version}/file/**` - `/api/{namespace}/{extension}/{targetPlatform}/{version}/file/**` The last part of the url can be a file type (`download`, `icon`, `license`) or a file path (e.g. `extension/package.json`). Here it is pretty hard to make a distinction between calls that return a resource and calls that return another file type. **regex:** ``` \/api\/[\w\-\+\$~]+\/[\w\-\+\$~]+(\/[\w\-\+\$~]+)?\/[\w\-\+\$\.~]+\/file\/.* ```
Resource endpoint - `/vscode/unpkg/{namespaceName}/{extensionName}/{version}/**` Every call to this endpoint uses the resource file type. **regex:** ``` \/vscode\/unpkg\/.* ```
VSIX package download endpoint - `/vscode/gallery/publishers/{namespaceName}/vsextensions/{extensionName}/{version}/vspackage` Returns redirect to download vsix package. **regex:** ``` \/vscode\/gallery\/publishers\/[\w\-\+\$~]+\/vsextensions\/[\w\-\+\$~]+\/[\w\-\+\$\.~]+\/vspackage ```
Asset endpoint - `/vscode/asset/{namespaceName}/{extensionName}/{version}/{assetType}/**` Returns asset file. **regex to get any asset:** ``` \/vscode\/asset\/[\w\-\+\$~]+\/[\w\-\+\$~]+\/[\w\-\+\$\.~]+\/Microsoft\.VisualStudio\.((Services\.((Content\.(Details|Changelog|License))|Icons\.Default|VSIXPackage|VsixManifest|VsixSignature|PublicKey))|(Code\.(Manifest|WebResources)))\/.* ``` **regex to get only resources:** ``` \/vscode\/asset\/[\w\-\+\$~]+\/[\w\-\+\$~]+\/[\w\-\+\$\.~]+\/Microsoft\.VisualStudio\.Code\.WebResources\/.* ``` other asset types: - `Microsoft.VisualStudio.Services.Content.Details` - `Microsoft.VisualStudio.Services.Content.Changelog` - `Microsoft.VisualStudio.Services.Content.License` - `Microsoft.VisualStudio.Services.Icons.Default` - `Microsoft.VisualStudio.Services.VSIXPackage` - `Microsoft.VisualStudio.Services.VsixManifest` - `Microsoft.VisualStudio.Services.VsixSignature` - `Microsoft.VisualStudio.Services.PublicKey` - `Microsoft.VisualStudio.Code.Manifest` - `Microsoft.VisualStudio.Code.WebResources`
kineticsquid commented 2 months ago

@amvanbaren Thanks, this is really helpful. It looks like these paths are defined here: https://github.com/eclipse/openvsx/blob/master/server/src/main/java/org/eclipse/openvsx/web/WebConfig.java.

Do all of these URLs and asset types result in references to the file_resource table?

I also noticed a path \documents. I can't figure out where that's processed. What does it return and does it also hit the file_resource table?

It seems like to get a handle on URLs (not part of the API) that cause references to the file_resources table would be to filter the access logs to \vscode and (maybe) \documents. That right?

amvanbaren commented 2 months ago

WebConfig is to configure extra features on top, like CORS and interceptors for mirror mode. You can find the actual endpoints defined in: VSCodeAPI and RegistryAPI

The /documents endpoint serves static content, like the publisher agreement and terms of use.

kineticsquid commented 2 months ago

@amvanbaren I think I understand. Ultimate goal is to reduce the size of the file_resource table. The next step will be to get another sample of the access logs that cause references to this table. Based on the above, I think what we want are all references to \api\...file and \vscode. That sound right?

amvanbaren commented 2 months ago

Yes, sounds right. You can further narrow down /vscode requests to /vscode/asset and /vscode/unpkg.

kineticsquid commented 2 months ago

What about? Do these not cause a file lookup?

                            "/vscode/item",
                            "/vscode/gallery/publishers/**",
amvanbaren commented 2 months ago

/vscode/item redirects to the extension page in the webui: https://github.com/eclipse/openvsx/blob/20c549ebcd1f082638a4cd493413f9a527ab8f94/server/src/main/java/org/eclipse/openvsx/adapter/LocalVSCodeService.java#L354

/vscode/gallery/publishers/** returns a link to a vsix package. You could include it, but extension downloads are pretty non-negotiable. https://github.com/eclipse/openvsx/blob/20c549ebcd1f082638a4cd493413f9a527ab8f94/server/src/main/java/org/eclipse/openvsx/adapter/LocalVSCodeService.java#L380

kkistm commented 1 month ago

As it was discussed with @kineticsquid, I am putting my thoughts about caching implementation here. Just to save them in some place where everybody can see it.

The question is about getting rid of the file_resource table and also necessity to unpack .vsix every time a new file is needed from it. I think we can reduce (or even avoid) necessity to unpack the extension several times if the files from it are requested in a short period of time. The idea is obviously to use some form of caching on Java side. I see two viable options:

I haven't looked at scenarios to add cache as a separate application, like Redis or memcached, because it might make the setup unnecessary complex. I aldo don't know if Elastic could be used as a key/value storage. A separate topic is to how to populate the cache(s), but there some ExecutorService instance could help.