Refactor document factory to support async parsing

msujew commented 10 months ago

Closes #273

This change allows adopters to create asynchronous parsers to prevent blocking the event loop when parsing large files. Note that the default implementation is still synchronous. Adopters have to supply the necessary worker threads/webworkers themselves.

Most things in this change are pretty backwards compatible. The only really breaking change is that getOrCreateDocument is now async and returns a promise. In addition to that method, the LangiumDocuments now expose getDocument(URI): LangiumDocument | undefined and createDocument (both sync and async).

sailingKieler commented 10 months ago

@msujew Could you give a brief summary on what you intended to change for which reason? That would simplify the review of the particular changes...

msujew commented 10 months ago

@sailingKieler The intended changes are:

Allow for async parsing in the document factory. While the parsing process isn't async in itself, we can improve performance of Langium by moving the parsing process over to worker threads/web workers. By doing this, parsing becomes async. Additionally, this gives us the option to interrupt parsing if we need to. This is very important for large files that might take multiple seconds to parse.

This means that all factory methods that somehow produce an AST from a string can now be async as well, assuming they use the new AsyncParser service. I've kept the sync versions in there for now.

As a consequence, we don't need the sync file loading anymore. If parsing is async, file loading can be async as well. It has always irked me that we had sync file loading in the framework in the first place. See also #273. It prevented parallel file loading during workspace startup and is generally viewed as an anti-pattern in modern JS development. Other file systems (such as loading a workspace via HTTP), also were async only, so you would always get an error if you tried to load a document through getOrCreateDocument as it relied on the sync file loading mechanism.

sailingKieler commented 10 months ago

@sailingKieler The intended changes are:

Allow for async parsing in the document factory. While the parsing process isn't async in itself, we can improve performance of Langium by moving the parsing process over to worker threads/web workers. By doing this, parsing becomes async. Additionally, this gives us the option to interrupt parsing if we need to. This is very important for large files that might take multiple seconds to parse.

This means that all factory methods that somehow produce an AST from a string can now be async as well, assuming they use the new AsyncParser service. I've kept the sync versions in there for now.

As a consequence, we don't need the sync file loading anymore. If parsing is async, file loading can be async as well. It has always irked me that we had sync file loading in the framework in the first place. See also #273. It prevented parallel file loading during workspace startup and is generally viewed as an anti-pattern in modern JS development. Other file systems (such as loading a workspace via HTTP), also were async only, so you would always get an error if you tried to load a document through getOrCreateDocument as it relied on the sync file loading mechanism.

@msujew Thank's a lot for these explanations. 🙏 What I was also interested in was where you basically put on the saw (probably in the document builder) and what the resulting consequences are.

msujew commented 10 months ago

@sailingKieler Aside from the changes in the document factory, there is very little change. The most major one is the change to getOrCreateDocument in the LangiumDocuments service, which is now just async. All sync callers can use the new getDocument(): LangiumDocument | undefined instead and call createDocument() explicitly if they need to.

msujew commented 9 months ago

Shall we add an implementation of AsyncParser with Node.js worker threads in src/node to make the benefit of this change more apparent?

@spoenemann I'd love to, but I don't have time for this before the holidays/the 3.0 release. I'd postpone this implementation until 3.1.0. This won't be a breaking change, as it only adds a new implementation.

Otherwise, I've incorporated all your changes 👍

spoenemann commented 9 months ago

Shall we add an implementation of AsyncParser with Node.js worker threads in src/node to make the benefit of this change more apparent?

@spoenemann I'd love to, but I don't have time for this before the holidays/the 3.0 release. I'd postpone this implementation until 3.1.0. This won't be a breaking change, as it only adds a new implementation.

The only issue with this is: are we sure that we won't run into blockers when we implement a parser worker pool? For example, how will we solve the data transfer of AST and CST?

msujew commented 9 months ago

@spoenemann Since we have postponed 3.0 to after the holidays, I'll probably integrate the implementation before the release. However, I would prefer to do this in a separate PR. This PR is large enough as it is.

sailingKieler commented 9 months ago

I'm fine with the changes except for the distributed file system accesses. I would appreciate if you could consolidate them, ideally within the DocumentFactory.

msujew commented 9 months ago

@sailingKieler I've added a new LangiumDocumentFactory#fromUri method that consolidates all the disk access into one place.

cdietrich commented 7 months ago

i wonder if we can have a readFileSync readDirSync back in the FileSystem provider this would at least allow to mimic the old getOrCreateDocument more easily

spoenemann commented 7 months ago

We could, but those operations are harder to implement in the frontend or other context where async is the default.

cdietrich commented 7 months ago

my problem is that exchaning the FileSystemProvider with a own subclass is pita. or should we subclass

EmptyFileSystemProvider, NodeFileSystemProvider,

create respective copies of NodeFileSystem, EmptyFileSystem.

and at usageplace just downcast to our subinterface?

cdietrich commented 7 months ago

another question: why is there no userdata in index entires. our main usecase to resolve during scoping is cause we can not filter pure on what is in AstNodeDescription

msujew commented 7 months ago

why is there no userdata in index entires. our main usecase to resolve during scoping is cause we can not filter pure on what is in AstNodeDescription

We had no need for it yet, and since the AstNodeDescriptions are just plain JS objects, you can always extend/cast them to add new fields. I'd be fine with adding a data: any field to all AstNodeDescriptions.

i wonder if we can have a readFileSync readDirSync back in the FileSystem provider

I'd be open for it. I'm wondering though in which case you need a sync read, when the document builder itself is async? Can you explain that a bit?

cdietrich commented 7 months ago

no in the scope provider we do:

for astdescription in index, call getOrCreateDocument, but the ast nodes into a stream. filter based on our critera, create scope based on that.

msujew commented 7 months ago

Isn't getDocument enough for that case? Do you have AstDescription objects in your index for unloaded documents?

cdietrich commented 7 months ago

Get document might not find it. At least in some cases. Did not look into that in detail yet. Will do

cdietrich commented 7 months ago

looks like so far it was a set of test data error anyway (that somehow worked in 2x). works as expected if that one is fixed need to do more tests though

eclipse-langium / langium

Refactor document factory to support async parsing #1306