eclipse / openvsx

An open-source registry for VS Code extensions
https://open-vsx.org/
Eclipse Public License 2.0
1.21k stars 132 forks source link

Are we managing extension files efficiently #980

Open kineticsquid opened 1 week ago

kineticsquid commented 1 week ago

The origin of this is https://github.com/EclipseFdn/open-vsx.org/issues/2317. Ultimately our objective is, for a variety of reasons, to keep the size of our DB manageable. The file_resource table is by far the largest. I poked at this a bit in a Gitpod workspace. In that sample workspace, there are 22 extensions, 40 versions, and a little over 99K entries in the file table. It looks like, in addition to the files listed in the extension package.json file, e.g. readme, download, license, icon, all the files included in the .vsix file are also listed in this table. The type is resource: https://raw.githubusercontent.com/kineticsquid/openvsx/master/output.txt.

Separately, looking at a sample of the open-vsx.org access logs, I can see /file API calls requesting these files.

I can understand the access to the icon, license, readme files for all the versions of an extension. I don't understand the logic that results in access requests to these other files. Unfortunately, I don't understand the code enough to figure out where these calls are coming from, UI or server.

@amvanbaren @filiptronicek @spoenemann Any insight or background on this?

amvanbaren commented 1 week ago

Some context: https://github.com/eclipse/openvsx/issues/432

I've deployed a proof of concept to staging where resource files are extracted from the vsix package on the fly. The response to the initial request is slower, but reasonable (2 - 3 seconds). The response is cached for 30 days, so subsequent requests are faster. The upside is less rows in the file_resource table and less files in blob storage. The downside is slower response times and most likely higher bandwidth usage. This can be an acceptable trade-off if only a limited set of resource files are requested, e.g. 80% cached responses and 20% on the fly generated responses.

kineticsquid commented 1 week ago

@amvanbaren Thanks, this seems a reasonable approach. But before we go there, I'd like to understand a bit more about the use case(s).

amvanbaren commented 1 week ago

Where are these calls coming from and why is the info returned by our API insufficient?

These calls are coming from VS Code based editors. The info returned by the API was insufficient, because it was only returned for extensions with web in their tag list.

Is this specified in package.json like this?

The tags list in the extension.vsixmanifest file was used.

What web resource files are extracted (presumably and entered in the file_resource table) and not extracted for other types of extensions?

All files in the extension folder were recursively added to the file_resource table. So basically all files in the vsix package.

I could imagine imagine optimizations for extensions management in IDE UIs, but wouldn't that require only the files from the latest version?

This is not on the extension level. A resource is requested for a specific version of an extension.

Is there a use case for this?

Yes, it is to keep feature parity with the MS VS Code API: https://ms-python.vscode-unpkg.net/ms-python/python/2024.14.0/extension/out/client/node_modules/ https://open-vsx.org/vscode/unpkg/ms-python/python/2024.14.0/extension/out/client/node_modules/

kineticsquid commented 1 week ago

@amvanbaren Thanks for the additional info, this is helpful. A couple of follow up questions.

amvanbaren commented 6 days ago

Once an extension is installed, an editor presumably has all the files.

Yes, I think the desktop editor uses local files. It looks like this functionality is used by VS Code server deployments, like the Gitpod openvscode-server.

so are these calls made for information for extensions that are not installed in the IDE?

That could be possible too.

Does Theia make similar calls?

It can through the /api/file/... endpoints by file path, but a quick look at the Theia source code makes me think it only uses predefined file urls (download, icon, manifest, etc.) https://github.com/eclipse-theia/theia/blob/19556f4d90c1b661ba53caea9b6a035a714e112d/dev-packages/ovsx-client/src/ovsx-types.ts#L198

kineticsquid commented 11 hours ago

@amvanbaren we believe we're seeing calls (not through the /file API from Gitpod. What about VS Codium?

Ultimately I think we're going to need to sample our access logs again to get a better picture. Can you recommend a text filter to limit the entries we're looking for?