load in-memory source text for multiple files

mlin commented 5 years ago

WDL.load() accepts a source_text string as an alternative to reading from the filename/URI, which is useful for Language Server Protocol to analyze code not necessarily written to disk. But if the WDL imports other files, right now they still need to be written to disk.

We'd like there to be a way to pass in multiple in-memory source files which may import amongst themselves. This might take the form of a Dict[str,str] from filenames to source texts; or some kind of abstract SourceLoader interface, which could envelop path and the current import_uri callback. Make it asynchronous while we're at it??

Subtleties:

filename resolution -- the imports might use absolute or relative filenames, or depend on a directory path to search. How to ensure we always get the right source text for a given import path, without relying on the filesystem?
still need to be able to read imported files from disk if they don't happen to be present among the in-memory source texts (e.g. open files in IDE)
can we avoid repeated processing of imported in-memory files? (LSP doesn't necessarily designate one file as the 'focal' point from which to start processing)

dinvlad commented 5 years ago

w.r.t. LSP, we’ll likely use workspace/didChangeWatchedFiles capability, where the client periodically sends the list of “watched” files to the server. Then we set up each client to watch for all .wdl files in the workspace, and this will inform the server which file paths to take into account.

dinvlad commented 5 years ago

Happy to report our initial implementation of in-memory parsing for LSP: https://github.com/broadinstitute/wdl-ide/blob/bf064a0c8f093e4d1c984e561f0bd1eea1009aa0/wdl-language-support/server/server.py#L135-L169

The server:

1) Listens to the initial textDocument/didOpen notification from the client, and caches the received contents of the document into a Dict[uri, List[line]].

2) Listens to textDocument/didChange notifications from the client, which send incremental updates for the document. The server adjusts dict[uri] text accordingly on every update. Then, it runs WDL.load(doc.uri, source_text=doc.text) asynchronously with a debounce Timer, so that the new content gets re-validated only when there were no more changes sent over a given timeout (1 sec currently).

3) Listens to the final textDocument/didClose notification from the client, and removes the cached content from memory then.

dinvlad commented 5 years ago

It turns out that pygls already maintains a registry of open documents on the server, which we can access easily with ls.workspace.get_document(uri).source

dinvlad commented 5 years ago

I've now also enabled auto-resolution of WDL paths [1]:

1) Initially, the server takes a WDL URI and finds "workspace root" folders that are prefixes of this URI. It then enumerates all folders containing *.wdl files within those roots. Finally, it caches this set of folders for each root.

2) Upon this and subsequent invocations, the server passes the relevant list of folders from (1) as path argument to WDL.load(). This results in successful resolution of all imports.

3) The client watches for created/deleted *.wdl files in all workspaces, and sends such changes to the server. The server then updates the relevant cache of folders from (1).

As a result, we've achieved efficient and automatic resolution of WDL imports. In the Optimus pipeline, initial path resolution takes < 20 ms. In a bigger project with tons of node_modules folders, it takes ~ 0.8 sec. Cached resolution takes << 1 ms in both cases.

@rexwangcc @mohawktrail

mlin commented 5 years ago

@dinvlad I'm thinking about making a new version of WDL.load() into which one could pass an async function with this signature:

async def read_source(uri: str, path: List[str], importer_uri: Optional[str]) -> str:
    ...

where

path is the user-provided search directories if any
importer_uri is the filename/URI of the document from which the requested document is being imported if any
async just keeps it maximally flexible for the future

The default implementation of course will just treat uri as a filename, open and read it. But you could override it to intercept requests for files with unsaved edits in the IDE.

Naturally I gravitated to this because of how it punts the aforementioned complications to you :sweat_smile:. Do you think it would work?

dinvlad commented 5 years ago

Thanks - so if I understand it correctly, this will allow us to do something like:

WDL.load(
  wdl_uri,
  path = paths,
  read_source = lambda uri: ls.workspace.get_document(uri).source,
)

In this particular example, we just ignore path and importer_uri args to read_source(), because our resolution function only depends on the uri.

We also assume that read_source() will be applied to all URIs, including wdl_uri in the example (so we no longer have to pass source_text).

If the above holds, then this seems like a good match for our needs!

chanzuckerberg / miniwdl

load in-memory source text for multiple files #177