asciidoctor / asciidoctor.js

:scroll: A JavaScript port of Asciidoctor, a modern implementation of AsciiDoc
https://asciidoctor.org
MIT License
719 stars 131 forks source link

fetch includes asynchronously (in-browser / client-side) #630

Open tigregalis opened 5 years ago

tigregalis commented 5 years ago

I'd like to fetch documents and includes asynchronously (in-browser), using fetch.

By default, it uses synchronous XMLHttpRequest, which locks the browser and also give us a warning:

Synchronous XMLHttpRequest on the main thread is deprecated because of its detrimental effects to the end user's experience. For more help http://xhr.spec.whatwg.org/    asciidoctor.js:23436:8 

That's why I'd like to suggest a loadAsync / convertAsync API for Asciidoctor.js.

Perhaps a related suggestion is a way for the user to specify their own http client, e.g. if someone wishes to use fetch, axios, or node-fetch.

As a really convoluted work-around / proof-of-concept, I've managed to use two include processor extensions and Promises to progressively "load" the document. Right now this will only go down to one level of includes. There are a number of other issues of course, but primarily is that it's loading/parsing the document more than once.

const asciidoctor = Asciidoctor();

const root = document.getElementById('root');

function promiseObject(object) {

  let promisedProperties = [];

  const objectKeys = Object.keys(object);

  objectKeys.forEach((key) => promisedProperties.push(object[key]));

  return Promise.all(promisedProperties)
    .then((resolvedValues) => {
      return resolvedValues.reduce((resolvedObject, property, index) => {
        resolvedObject[objectKeys[index]] = property;
        return resolvedObject;
      }, object);
    });

}

fetch('index.adoc').then(res => res.text()).then(content => {

  const includesMap = {};

  const registry1 = asciidoctor.Extensions.create();
  registry1.includeProcessor(function () {
    var self = this;
    self.handles(_ => true);
    self.process(function (doc, reader, target, attrs) {
      if (!includesMap[target]) includesMap[target] = fetch(target).then(res => res.text());
      return reader.pushInclude('pending...', target, target, 1, attrs);
    });
  });

  asciidoctor.load(content, {'safe': 'safe', extension_registry: registry1});

  promiseObject(includesMap).then(includes => {

    const registry2 = asciidoctor.Extensions.create();
    registry2.includeProcessor(function () {
      var self = this;
      self.handles(_ => true);
      self.process(function (doc, reader, target, attrs) {
        return reader.pushInclude(includes[target], target, target, 1, attrs);
      });
    });

    const doc = asciidoctor.load(content, {'safe': 'safe', extension_registry: registry2});
    const html = doc.convert();
    root.innerHTML = html;

  });

});
ggrossetie commented 5 years ago

Hello @tigregalis

I had this discussion with @mojavelinux but it's not an easy task :sweat_smile:

As you saw the includes need to be resolved sequentially. In fact Asciidoctor parses the document from top to bottom and will resolve includes along the way. An AsciiDoc document can contain attributes and attributes can change how the document is processed. So if we don't process includes sequentially the output will be different.

For instance let's take this simple example:

main.adoc

include::attrs.adoc[]
include::installation.adoc[]

attrs.adoc

:program-version: 1.2.3

installation.adoc

Install program version {program-version}

In this example attrs.adoc must be processed before installation.adoc otherwise {program-version} won't be resolved. It's a simple example but of course you can have more complex situation where the attribute is defined in an include of include of include... or where we use condition to include or not a file.

However I think we can safely resolve includes asynchronously when the included file is not an AsciiDoc document:

[source,ruby]
----
include::app.rb[]
----

Or (embedded) images when :data-uri: is defined:

image::sunset.jpg[Sunset,300,200]

We could hack something in Asciidoctor.js but I think it would be better if Asciidoctor core knows that the function can be async...

mojavelinux commented 5 years ago

If we think outside of the include processor (so to speak), the solution to this is a lot easier than we realize. What we want to do is start processing the document once all the includes have been resolved. In order to do that, we flip the include processing inside-out. First, we grab all the include URLs, then toss them into a Promise.all to build a map of URL to content. When that promise resolves, we invoke the processor. The include processor can reach back into this resolved map to grab the content for a URL when it comes across an include. (In fact, you don't have to limit it to URLs since file IO is also async).

This is actually the right way to do async includes. If we designed the processor to understand an async include processor, all we would end up doing is turning it into a sync operation because we can't advance the line until we have the content.

The benefit of the approach I'm proposing is that it opens the door to managing the cache of the content you downloaded to avoid having to download it every time. And if you're processing a lot of documents, you could even avoid repeat visits to the same URL.

In order to make this work, we'd need an API for extracting the include targets. (Keep in mind this could be recursive). It's actually pretty simple since includes aren't context sensitive (they don't see blocks). A very naive approach would just be to look for any lines that match an include directive. A slightly better approach is to handle all preprocessor conditionals too.

There might be some sharp edges we'll find, but I think we can get most of the way there, then think about how to take it the rest of the way after that.

mojavelinux commented 5 years ago

Btw, in order to support URL includes in Antora, I'd really like to do it this way instead of invoking a sync subprocess. (Though we could to the latter first, then improve on it).

Related issue in Antora: https://gitlab.com/antora/antora/issues/246

tigregalis commented 5 years ago

After reading @Mogztter's comment I was about to suggest, but @mojavelinux already beat me to it: Can you defer the resolution of attributes until all includes have been retrieved?

In the case of recursion, recursion is already an issue as it is:

recursion-1.adoc:

== this is from recursion 1

hello

include::recursion-2.adoc[]

recursion-2.adoc:

== this is from recursion 2

world

include::recursion-1.adoc[]

renders:

this is from recursion 1

hello

this is from recursion 2

world

this is from recursion 1

hello

this is from recursion 2

world

...

this is from recursion 1

hello

include::recursion-2.adoc[]

To deal with this, it would be appropriate to generate a tree of includes, and at each include, check along the current path from root-to-branch.

Given the following documents:

And the following includes, in this order:

The tree generated would be:

ggrossetie commented 5 years ago

The benefit of the approach I'm proposing is that it opens the door to managing the cache of the content you downloaded to avoid having to download it every time. And if you're processing a lot of documents, you could even avoid repeat visits to the same URL.

:+1:

In order to make this work, we'd need an API for extracting the include targets. (Keep in mind this could be recursive). It's actually pretty simple since includes aren't context sensitive (they don't see blocks).

It might be an edge case but what about:

attrs.adoc

:include-all:

main.adoc

include::attrs.adoc[]

ifdef::include-all[]
include::a.adoc[]
include::b.adoc[]
include::c.adoc[]
include::d.adoc[]
include::e.adoc[]
include::f.adoc[]
// ...
endif::[]

If we took the naive approach we will fetch a.adoc, b.adoc, c.adoc, d.adoc, e.adoc, f.adoc... for nothing :disappointed: The other approach is to wait for attrs.adoc to resolve but then it's just a sequential asynchronous include resolver. We could of course try to do things in parallel if the include target does not contains any attribute nor include directive. It means that this API should be able to resolve attributes and evaluate conditions.

A very naive approach would just be to look for any lines that match an include directive. A slightly better approach is to handle all preprocessor conditionals too.

What do you mean by conditionals ? We should execute all the registered preprocessor but I don't get the conditionals ? Maybe you are referring to ifdef ?

In the case of recursion, recursion is already an issue as it is:

If there's a infinite loop, we should use the max-include-depth attribute. That's what Asciidoctor core Ruby is using to "abort" and avoid an endless recursive loop.

The tree generated would be:

Indeed if we have the complete tree we could do the resolution and detect loop early :+1:

tigregalis commented 5 years ago

A very naive approach would just be to look for any lines that match an include directive. A slightly better approach is to handle all preprocessor conditionals too.

What do you mean by conditionals ? We should execute all the registered preprocessor but I don't get the conditionals ? Maybe you are referring to ifdef ?

Attributes can also be in the target itself, e.g, https://docs.antora.org/antora/1.0/asciidoc/include-content/#include-partial

include::{partialsdir}/log-definition.adoc[]

In order to make this work, we'd need an API for extracting the include targets. (Keep in mind this could be recursive). It's actually pretty simple since includes aren't context sensitive (they don't see blocks).

It might be an edge case but what about:

attrs.adoc

:include-all:

main.adoc

include::attrs.adoc[]

ifdef::include-all[]
include::a.adoc[]
include::b.adoc[]
include::c.adoc[]
include::d.adoc[]
include::e.adoc[]
include::f.adoc[]
// ...
endif::[]

If we took the naive approach we will fetch a.adoc, b.adoc, c.adoc, d.adoc, e.adoc, f.adoc... for nothing 😞 The other approach is to wait for attrs.adoc to resolve but then it's just a sequential asynchronous include resolver. We could of course try to do things in parallel if the include target does not contains any attribute nor include directive. It means that this API should be able to resolve attributes and evaluate conditions.

You could do it greedily, but iteratively.

As far as I'm aware, there are four types of includes:

The flow would be:

If all includes were unconditional, this is the best case scenario, and the lowest time spent waiting for files.

If all includes were conditional, this is the worst case scenario, and a lot of time would be spent waiting for files.

Example

Retrieve the root document

Parse the root document to find unconditional includes and conditional includes, and start building up the tree one-level deep, leaving a stub for each.

Retrieve all unconditional stubs. Don't retrieve the conditional stubs.

Parse the retrieved documents to find unconditional includes and conditional includes.

At this point, there are two conditions which are true: there are no unconditional stubs, and there are conditional stubs. So resolve the attributes up to the first conditional.

Even if the include-all attribute was changed within c.adoc, it was true at the start of the ifdef block. "Upgrade" the conditional stubs to unconditional:

Retrieve all unconditional stubs. Don't retrieve the conditional stubs.

Parse the retrieved documents to find unconditional includes and conditional includes.

At this point, there are no unconditional stubs, and there are no conditional stubs. So start processing the overall document.

ggrossetie commented 5 years ago

Thanks @tigregalis for the detailed example. I gave it more thoughts and it's basically what I had in mind 👍

@mojavelinux I'm wondering how we should proceed ? Maybe we could implement something in Asciidoctor.js and introduce an experimental API so we can test it against various documents and fine tune the solution (before committing to it in Asciidoctor core) ?

Speaking of attributes and resolution, it's reminding me of another issue: https://github.com/asciidoctor/asciidoctor/issues/1911#issuecomment-266967930 Maybe we could also address this issue since we will need to parse the included document. Just an idea to keep in mind as it might be better to stay focus on the initial goal.

ggrossetie commented 5 years ago

@tigregalis Are you working on it ? I want to give it a try but I don't want to step on your toes :wink:

tigregalis commented 5 years ago

@Mogztter Sorry, I've been on site, far from civilisation.

I've seen that you've started making some changes in https://github.com/asciidoctor/asciidoctor.js/pull/633 and read a bit of the discussion there.

I don't understand the codebase well enough to contribute, and the syntax has a lot of quirks (discussed further below), but I have been playing around with building a wrapper/extension as a proof-of-concept, with my plan as follows:

Other than asciidoctor.js itself processing the file(s) at the end, this should be completely async and should be quite efficient.

I've been going through the asciidoctor user manual and testing a number of scenarios, and I'm finding there's a lot of flexibility, a lot of exceptions and a lot of strange behaviour that is hard to account for. Asynchronously fetching includes is relatively easy, even doing nested includes: as long as it's not conditional. Doing conditional resolution of includes though... a lot of these edge-cases can be pretty nightmarish.

I guess it stems from the fact that includes are just points in the target file at which you can substitute the include text with the contents of the referenced file, i.e. partials. There are no syntactical rules for when an include is valid or not. All includes are valid. Some of this unfortunate flexibility is also found with the ifs and attributes.

This is an aside, but are syntactical changes within the scope of Asciidoctor 2.x? If so, there are a few ideas I have to make it more logical and tree-like:

I'd also suggest building it async-first.

A JSON AST as an output format would be excellent as well: I've dabbled with writing an extension to do this, but haven't gotten very far yet.

ggrossetie commented 5 years ago

@tigregalis No worries! Again thanks for taking the time to write your thoughts.

I think we should iterate on this feature and take one step at a time. While I agree with a lot of things you said I think we should leave attributes for now and focus on unconditional includes that don't rely on attributes. I think it set an achievable goal in a near future and we will still learn a lot in the process:

Even with this limited scope we have a few things to decide. For instance, should we use an include extension to resolve files from the "virtual file system" or should we add a "native" support in Asciidoctor core ? If we add a native support, Asciidoctor core could also ask the VFS to resolve images when using the data-uri attribute. We could create an API to manipulate the VFS (could be useful if the user wants to contribute to it):

vfs.addFile({path: '/path/to/file.adoc', contents: Buffer()})

or create/provide one:

asciidoctor.convert(input, {vfs: vfs})

At some point we will also need to decide if we should replace the include directive with the file content when it's "safe". Might be a premature optimization but we don't need to parse an include twice with this simple example:

main.adoc

include::a.adoc[]

a.adoc

a

In this case we could resolve the include and call Asciidoctor with the following input:

main.adoc

a

Anyway first thing first, let's work on the virtual file system and on the process to fetch files efficiently.

tigregalis commented 5 years ago

Haven't been keeping up with this for a while. Has this moved at all?

If I want to contribute to the codebase, where would be the best place to start (reading)?

ggrossetie commented 5 years ago

Haven't been keeping up with this for a while. Has this moved at all?

It's still a work in progress: https://github.com/asciidoctor/asciidoctor.js/pull/644 But given the fact that more and more JavaScript libraries are "async first" and that browsers might disallow synchronous XMLHttpRequest in a mid-term future I think it's important to find a solution.

If I want to contribute to the codebase, where would be the best place to start (reading)?

If you want to understand how Asciidoctor resolves the include directive then you should read the Asciidoctor Ruby code base. In fact, the code generated from Opal is harder to read.

The Asciidoctor.js code base is now in the packages/core directory. Please note that most of the code is generated from the Asciidoctor Ruby library.

If you want to override a method from Asciidoctor Ruby you should create a Ruby file. For instance we override the Document#fill_datetime_attributes method in lib/asciidoctor/js/asciidoctor_ext/document.rb (and the file is required in asciidoctor_ext.rb).

mojavelinux commented 8 months ago

@mojavelinux I'm wondering how we should proceed ?

To be honest, I just don't know. The language was designed in such a way that includes are synchronous and trying to decouple that in any way has side effects that may cause the parser to produce a different result. (Yes, I said what I said above, but in reality it's just more complex).

One possible solution is to do something like what Asciidoctor Reducer does, which is to expand all the includes before parsing (i.e., preprocess the document from top to bottom). The benefit there is that you can leverage the existing parser without having to resort to using what amounts to an alternate parser. Of course, that means Asciidoctor Reducer would need to be able to work with async operations itself...so changes may still be needed, but it's perhaps less invasive.

In the AsciiDoc Language, what we really need to do is be able to walk a document to identify preprocessor directives without parsing...which may even end up being a separate step in the parsing of AsciiDoc. It's a major problem for defining a grammar as well.

mojavelinux commented 8 months ago

browsers might disallow synchronous XMLHttpRequest

IMHO, this is just narrow minded of them.