Open kineticsquid opened 2 months ago
Some context: https://github.com/eclipse/openvsx/issues/432
I've deployed a proof of concept to staging where resource files are extracted from the vsix package on the fly. The response to the initial request is slower, but reasonable (2 - 3 seconds). The response is cached for 30 days, so subsequent requests are faster.
The upside is less rows in the file_resource
table and less files in blob storage. The downside is slower response times and most likely higher bandwidth usage. This can be an acceptable trade-off if only a limited set of resource files are requested, e.g. 80% cached responses and 20% on the fly generated responses.
@amvanbaren Thanks, this seems a reasonable approach. But before we go there, I'd like to understand a bit more about the use case(s).
\open-vsx.org\*\asset\...
and \open-vsx.org\*\gallery\...
. Where are these calls coming from and why is the info returned by our API insufficient?package.json
like this?
"extensionKind": [
"workspace",
"web"
],
file_resource
table) and not extracted for other types of extensions?file_resource
table in https://raw.githubusercontent.com/kineticsquid/openvsx/master/output.txt, I can see things like Javascript dependencies. Is there a use case for this, or is it a byproduct of how we're processing the extensions?Where are these calls coming from and why is the info returned by our API insufficient?
These calls are coming from VS Code based editors. The info returned by the API was insufficient, because it was only returned for extensions with web
in their tag list.
Is this specified in package.json like this?
The tags list in the extension.vsixmanifest file was used.
What web resource files are extracted (presumably and entered in the file_resource table) and not extracted for other types of extensions?
All files in the extension folder were recursively added to the file_resource
table. So basically all files in the vsix package.
I could imagine imagine optimizations for extensions management in IDE UIs, but wouldn't that require only the files from the latest version?
This is not on the extension level. A resource is requested for a specific version of an extension.
Is there a use case for this?
Yes, it is to keep feature parity with the MS VS Code API: https://ms-python.vscode-unpkg.net/ms-python/python/2024.14.0/extension/out/client/node_modules/ https://open-vsx.org/vscode/unpkg/ms-python/python/2024.14.0/extension/out/client/node_modules/
@amvanbaren Thanks for the additional info, this is helpful. A couple of follow up questions.
Once an extension is installed, an editor presumably has all the files.
Yes, I think the desktop editor uses local files. It looks like this functionality is used by VS Code server deployments, like the Gitpod openvscode-server.
so are these calls made for information for extensions that are not installed in the IDE?
That could be possible too.
Does Theia make similar calls?
It can through the /api/file/...
endpoints by file path, but a quick look at the Theia source code makes me think it only uses predefined file urls (download, icon, manifest, etc.) https://github.com/eclipse-theia/theia/blob/19556f4d90c1b661ba53caea9b6a035a714e112d/dev-packages/ovsx-client/src/ovsx-types.ts#L198
@amvanbaren we believe we're seeing calls (not through the /file
API from Gitpod. What about VS Codium?
Ultimately I think we're going to need to sample our access logs again to get a better picture. Can you recommend a text filter to limit the entries we're looking for?
@amvanbaren Thanks, this is really helpful. It looks like these paths are defined here: https://github.com/eclipse/openvsx/blob/master/server/src/main/java/org/eclipse/openvsx/web/WebConfig.java.
Do all of these URLs and asset types result in references to the file_resource
table?
I also noticed a path \documents
. I can't figure out where that's processed. What does it return and does it also hit the file_resource
table?
It seems like to get a handle on URLs (not part of the API) that cause references to the file_resources
table would be to filter the access logs to \vscode
and (maybe) \documents
. That right?
WebConfig
is to configure extra features on top, like CORS and interceptors for mirror mode.
You can find the actual endpoints defined in: VSCodeAPI and RegistryAPI
The /documents
endpoint serves static content, like the publisher agreement and terms of use.
@amvanbaren I think I understand. Ultimate goal is to reduce the size of the file_resource
table. The next step will be to get another sample of the access logs that cause references to this table. Based on the above, I think what we want are all references to \api\...file
and \vscode
. That sound right?
Yes, sounds right. You can further narrow down /vscode
requests to /vscode/asset
and /vscode/unpkg
.
What about? Do these not cause a file lookup?
"/vscode/item",
"/vscode/gallery/publishers/**",
/vscode/item
redirects to the extension page in the webui: https://github.com/eclipse/openvsx/blob/20c549ebcd1f082638a4cd493413f9a527ab8f94/server/src/main/java/org/eclipse/openvsx/adapter/LocalVSCodeService.java#L354
/vscode/gallery/publishers/**
returns a link to a vsix package. You could include it, but extension downloads are pretty non-negotiable. https://github.com/eclipse/openvsx/blob/20c549ebcd1f082638a4cd493413f9a527ab8f94/server/src/main/java/org/eclipse/openvsx/adapter/LocalVSCodeService.java#L380
As it was discussed with @kineticsquid, I am putting my thoughts about caching implementation here. Just to save them in some place where everybody can see it.
The question is about getting rid of the file_resource
table and also necessity to unpack .vsix
every time a new file is needed from it. I think we can reduce (or even avoid) necessity to unpack the extension several times if the files from it are requested in a short period of time. The idea is obviously to use some form of caching on Java side. I see two viable options:
GuavaCache
: https://github.com/google/guava/wiki/CachesExplained. It easily allows to specify eviction policies to keep the cache small. The cache will be fast and fully under our control. The only drawback which is see is necessity to use cache per Java application instance, so potentially .vsix
could be unpacked several times.UNLOGGED
tables (https://www.crunchydata.com/blog/postgresl-unlogged-tables). In this case the cache will be shared among the instances. The eviction could be implemented with a store procedure run by pg_cron
, for example. The drawback is a certain speed penalty, but it looks like UNLOGGED
tables performance is quite good.I haven't looked at scenarios to add cache as a separate application, like Redis or memcached, because it might make the setup unnecessary complex. I aldo don't know if Elastic could be used as a key/value storage.
A separate topic is to how to populate the cache(s), but there some ExecutorService
instance could help.
The origin of this is https://github.com/EclipseFdn/open-vsx.org/issues/2317. Ultimately our objective is, for a variety of reasons, to keep the size of our DB manageable. The
file_resource
table is by far the largest. I poked at this a bit in a Gitpod workspace. In that sample workspace, there are 22 extensions, 40 versions, and a little over 99K entries in the file table. It looks like, in addition to the files listed in the extensionpackage.json
file, e.g. readme, download, license, icon, all the files included in the.vsix
file are also listed in this table. The type isresource
: https://raw.githubusercontent.com/kineticsquid/openvsx/master/output.txt.Separately, looking at a sample of the open-vsx.org access logs, I can see
/file
API calls requesting these files.I can understand the access to the icon, license, readme files for all the versions of an extension. I don't understand the logic that results in access requests to these other files. Unfortunately, I don't understand the code enough to figure out where these calls are coming from, UI or server.
@amvanbaren @filiptronicek @spoenemann Any insight or background on this?