Open tigregalis opened 5 years ago
Hello @tigregalis
I had this discussion with @mojavelinux but it's not an easy task :sweat_smile:
As you saw the includes need to be resolved sequentially. In fact Asciidoctor parses the document from top to bottom and will resolve includes along the way. An AsciiDoc document can contain attributes and attributes can change how the document is processed. So if we don't process includes sequentially the output will be different.
For instance let's take this simple example:
main.adoc
include::attrs.adoc[]
include::installation.adoc[]
attrs.adoc
:program-version: 1.2.3
installation.adoc
Install program version {program-version}
In this example attrs.adoc
must be processed before installation.adoc
otherwise {program-version}
won't be resolved.
It's a simple example but of course you can have more complex situation where the attribute is defined in an include of include of include... or where we use condition to include or not a file.
However I think we can safely resolve includes asynchronously when the included file is not an AsciiDoc document:
[source,ruby]
----
include::app.rb[]
----
Or (embedded) images when :data-uri:
is defined:
image::sunset.jpg[Sunset,300,200]
We could hack something in Asciidoctor.js but I think it would be better if Asciidoctor core knows that the function can be async...
If we think outside of the include processor (so to speak), the solution to this is a lot easier than we realize. What we want to do is start processing the document once all the includes have been resolved. In order to do that, we flip the include processing inside-out. First, we grab all the include URLs, then toss them into a Promise.all to build a map of URL to content. When that promise resolves, we invoke the processor. The include processor can reach back into this resolved map to grab the content for a URL when it comes across an include. (In fact, you don't have to limit it to URLs since file IO is also async).
This is actually the right way to do async includes. If we designed the processor to understand an async include processor, all we would end up doing is turning it into a sync operation because we can't advance the line until we have the content.
The benefit of the approach I'm proposing is that it opens the door to managing the cache of the content you downloaded to avoid having to download it every time. And if you're processing a lot of documents, you could even avoid repeat visits to the same URL.
In order to make this work, we'd need an API for extracting the include targets. (Keep in mind this could be recursive). It's actually pretty simple since includes aren't context sensitive (they don't see blocks). A very naive approach would just be to look for any lines that match an include directive. A slightly better approach is to handle all preprocessor conditionals too.
There might be some sharp edges we'll find, but I think we can get most of the way there, then think about how to take it the rest of the way after that.
Btw, in order to support URL includes in Antora, I'd really like to do it this way instead of invoking a sync subprocess. (Though we could to the latter first, then improve on it).
Related issue in Antora: https://gitlab.com/antora/antora/issues/246
After reading @Mogztter's comment I was about to suggest, but @mojavelinux already beat me to it: Can you defer the resolution of attributes until all includes have been retrieved?
In the case of recursion, recursion is already an issue as it is:
recursion-1.adoc
:
== this is from recursion 1
hello
include::recursion-2.adoc[]
recursion-2.adoc
:
== this is from recursion 2
world
include::recursion-1.adoc[]
renders:
this is from recursion 1
hello
this is from recursion 2
world
this is from recursion 1
hello
this is from recursion 2
world
...
this is from recursion 1
hello
include::recursion-2.adoc[]
To deal with this, it would be appropriate to generate a tree of includes, and at each include, check along the current path from root-to-branch.
Given the following documents:
And the following includes, in this order:
The tree generated would be:
The benefit of the approach I'm proposing is that it opens the door to managing the cache of the content you downloaded to avoid having to download it every time. And if you're processing a lot of documents, you could even avoid repeat visits to the same URL.
:+1:
In order to make this work, we'd need an API for extracting the include targets. (Keep in mind this could be recursive). It's actually pretty simple since includes aren't context sensitive (they don't see blocks).
It might be an edge case but what about:
attrs.adoc
:include-all:
main.adoc
include::attrs.adoc[]
ifdef::include-all[]
include::a.adoc[]
include::b.adoc[]
include::c.adoc[]
include::d.adoc[]
include::e.adoc[]
include::f.adoc[]
// ...
endif::[]
If we took the naive approach we will fetch a.adoc
, b.adoc
, c.adoc
, d.adoc
, e.adoc
, f.adoc
... for nothing :disappointed:
The other approach is to wait for attrs.adoc
to resolve but then it's just a sequential asynchronous include resolver. We could of course try to do things in parallel if the include target does not contains any attribute nor include directive.
It means that this API should be able to resolve attributes and evaluate conditions.
A very naive approach would just be to look for any lines that match an include directive. A slightly better approach is to handle all preprocessor conditionals too.
What do you mean by conditionals ? We should execute all the registered preprocessor but I don't get the conditionals ? Maybe you are referring to ifdef
?
In the case of recursion, recursion is already an issue as it is:
If there's a infinite loop, we should use the max-include-depth
attribute.
That's what Asciidoctor core Ruby is using to "abort" and avoid an endless recursive loop.
The tree generated would be:
Indeed if we have the complete tree we could do the resolution and detect loop early :+1:
A very naive approach would just be to look for any lines that match an include directive. A slightly better approach is to handle all preprocessor conditionals too.
What do you mean by conditionals ? We should execute all the registered preprocessor but I don't get the conditionals ? Maybe you are referring to
ifdef
?
Attributes can also be in the target itself, e.g, https://docs.antora.org/antora/1.0/asciidoc/include-content/#include-partial
include::{partialsdir}/log-definition.adoc[]
In order to make this work, we'd need an API for extracting the include targets. (Keep in mind this could be recursive). It's actually pretty simple since includes aren't context sensitive (they don't see blocks).
It might be an edge case but what about:
attrs.adoc
:include-all:
main.adoc
include::attrs.adoc[] ifdef::include-all[] include::a.adoc[] include::b.adoc[] include::c.adoc[] include::d.adoc[] include::e.adoc[] include::f.adoc[] // ... endif::[]
If we took the naive approach we will fetch
a.adoc
,b.adoc
,c.adoc
,d.adoc
,e.adoc
,f.adoc
... for nothing 😞 The other approach is to wait forattrs.adoc
to resolve but then it's just a sequential asynchronous include resolver. We could of course try to do things in parallel if the include target does not contains any attribute nor include directive. It means that this API should be able to resolve attributes and evaluate conditions.
You could do it greedily, but iteratively.
As far as I'm aware, there are four types of includes:
ifdef
block and no text-based attribute(s) in the target,ifdef
block and no text-based attribute(s) in the target,ifdef
block and with text-based attribute(s) in the target.The flow would be:
1 Retrieve document if document has not yet been retrieved and does not already appear in the path from root-to-branch.
2 Parse document to find unconditional includes and conditional includes, leaving a stub for each.
3 For each unconditional stub (or conditional stub where the precedent attribute(s) is locked),
4 When there are no more unconditional stubs AND there are conditional stubs, resolve the attributes in all documents up to the first conditional.
5 If attribute is part of an ifdef
and attribute is defined and true, upgrade all dependent stubs to unconditional. If within a target and attribute is defined, resolve the attribute in the target.
3 For each unconditional stub (or conditional stub where the precedent attribute(s) is locked),
4 When there are no more unconditional stubs AND there are conditional stubs, resolve the attributes in all documents up to the first conditional.
5 If attribute is part of an ifdef
and attribute is defined and true, upgrade all dependent stubs to unconditional. If within a target and attribute is defined, resolve the attribute in the target.
3 For each unconditional stub (or conditional stub where the precedent attribute(s) is locked),
...
6 When there are no more unconditional stubs AND there are no more conditionals, begin inserting document content and preparing the document for processing.
If all includes were unconditional, this is the best case scenario, and the lowest time spent waiting for files.
If all includes were conditional, this is the worst case scenario, and a lot of time would be spent waiting for files.
Retrieve the root document
main.adoc
["main.adoc"] RETRIEVEDParse the root document to find unconditional includes and conditional includes, and start building up the tree one-level deep, leaving a stub for each.
main.adoc
["main.adoc"] RETRIEVED PARSED
attrs.adoc
["main.adoc", "attrs.adoc"] STUBinclude-all
) PENDINGa.adoc
["main.adoc", "a.adoc"] STUBb.adoc
["main.adoc", "b.adoc"] STUBc.adoc
["main.adoc", "c.adoc"] STUBd.adoc
["main.adoc", "d.adoc"] STUBe.adoc
["main.adoc", "e.adoc"] STUBf.adoc
["main.adoc", "f.adoc"] STUBRetrieve all unconditional stubs. Don't retrieve the conditional stubs.
main.adoc
["main.adoc"] RETRIEVED PARSED
attrs.adoc
["main.adoc", "attrs.adoc"] RETRIEVEDinclude-all
) PENDINGa.adoc
["main.adoc", "a.adoc"] STUBb.adoc
["main.adoc", "b.adoc"] STUBc.adoc
["main.adoc", "c.adoc"] STUBd.adoc
["main.adoc", "d.adoc"] STUBe.adoc
["main.adoc", "e.adoc"] STUBf.adoc
["main.adoc", "f.adoc"] STUBParse the retrieved documents to find unconditional includes and conditional includes.
main.adoc
["main.adoc"] RETRIEVED PARSED
attrs.adoc
["main.adoc", "attrs.adoc"] RETRIEVED PARSED (no includes)include-all
) PENDINGa.adoc
["main.adoc", "a.adoc"] STUBb.adoc
["main.adoc", "b.adoc"] STUBc.adoc
["main.adoc", "c.adoc"] STUBd.adoc
["main.adoc", "d.adoc"] STUBe.adoc
["main.adoc", "e.adoc"] STUBf.adoc
["main.adoc", "f.adoc"] STUBAt this point, there are two conditions which are true: there are no unconditional stubs, and there are conditional stubs. So resolve the attributes up to the first conditional.
main.adoc
["main.adoc"] RETRIEVED PARSED RESOLVED (no attributes)
attrs.adoc
["main.adoc", "attrs.adoc"] RETRIEVED PARSED (no includes) RESOLVED - (include-all
: true)include-all
) TRUEa.adoc
["main.adoc", "a.adoc"] STUBb.adoc
["main.adoc", "b.adoc"] STUBc.adoc
["main.adoc", "c.adoc"] STUBd.adoc
["main.adoc", "d.adoc"] STUBe.adoc
["main.adoc", "e.adoc"] STUBf.adoc
["main.adoc", "f.adoc"] STUBEven if the include-all
attribute was changed within c.adoc
, it was true at the start of the ifdef
block. "Upgrade" the conditional stubs to unconditional:
main.adoc
["main.adoc"] RETRIEVED PARSED RESOLVED (no attributes)
attrs.adoc
["main.adoc", "attrs.adoc"] RETRIEVED PARSED (no includes) RESOLVED (include-all
: true)a.adoc
["main.adoc", "a.adoc"] STUBb.adoc
["main.adoc", "b.adoc"] STUBc.adoc
["main.adoc", "c.adoc"] STUBd.adoc
["main.adoc", "d.adoc"] STUBe.adoc
["main.adoc", "e.adoc"] STUBf.adoc
["main.adoc", "f.adoc"] STUBRetrieve all unconditional stubs. Don't retrieve the conditional stubs.
main.adoc
["main.adoc"] RETRIEVED PARSED RESOLVED (no attributes)
attrs.adoc
["main.adoc", "attrs.adoc"] RETRIEVED PARSED (no includes) RESOLVED (include-all
: true)a.adoc
["main.adoc", "a.adoc"] RETRIEVEDb.adoc
["main.adoc", "b.adoc"] RETRIEVEDc.adoc
["main.adoc", "c.adoc"] RETRIEVEDd.adoc
["main.adoc", "d.adoc"] RETRIEVEDe.adoc
["main.adoc", "e.adoc"] RETRIEVEDf.adoc
["main.adoc", "f.adoc"] RETRIEVEDParse the retrieved documents to find unconditional includes and conditional includes.
main.adoc
["main.adoc"] RETRIEVED PARSED RESOLVED (no attributes)
attrs.adoc
["main.adoc", "attrs.adoc"] RETRIEVED PARSED (no includes) RESOLVED (include-all
: true)a.adoc
["main.adoc", "a.adoc"] RETRIEVED PARSED (no includes)b.adoc
["main.adoc", "b.adoc"] RETRIEVED PARSED (no includes)c.adoc
["main.adoc", "c.adoc"] RETRIEVED PARSED (no includes)d.adoc
["main.adoc", "d.adoc"] RETRIEVED PARSED (no includes)e.adoc
["main.adoc", "e.adoc"] RETRIEVED PARSED (no includes)f.adoc
["main.adoc", "f.adoc"] RETRIEVED PARSED (no includes)At this point, there are no unconditional stubs, and there are no conditional stubs. So start processing the overall document.
Thanks @tigregalis for the detailed example. I gave it more thoughts and it's basically what I had in mind 👍
@mojavelinux I'm wondering how we should proceed ? Maybe we could implement something in Asciidoctor.js and introduce an experimental API so we can test it against various documents and fine tune the solution (before committing to it in Asciidoctor core) ?
Speaking of attributes and resolution, it's reminding me of another issue: https://github.com/asciidoctor/asciidoctor/issues/1911#issuecomment-266967930 Maybe we could also address this issue since we will need to parse the included document. Just an idea to keep in mind as it might be better to stay focus on the initial goal.
@tigregalis Are you working on it ? I want to give it a try but I don't want to step on your toes :wink:
@Mogztter Sorry, I've been on site, far from civilisation.
I've seen that you've started making some changes in https://github.com/asciidoctor/asciidoctor.js/pull/633 and read a bit of the discussion there.
I don't understand the codebase well enough to contribute, and the syntax has a lot of quirks (discussed further below), but I have been playing around with building a wrapper/extension as a proof-of-concept, with my plan as follows:
if
passages, code listings (code listings don't resolve attributes), and include
s in the fetched file.include
s have no dependencies, building up a "dependency tree".include
tree" (might be the same as the "dependency tree") so as to determine "root-to-branch" and prevent recursion entirely (I can't think of a single use-case for recursive text inclusion, especially without exit conditions).if
passages, code listings (code listings don't resolve attributes), and include
s in the fetched files.include
with a dependency.ifn?def
, determine whether defined, for ifeval
and {attribute-use}
, retrieve the value from the "attribute store"). If the file is to be included, it can effectively be treated as a new root file.include
processor that when encountering an include
, will intercept it and grab the files from the "virtual file system".Other than asciidoctor.js itself processing the file(s) at the end, this should be completely async and should be quite efficient.
I've been going through the asciidoctor user manual and testing a number of scenarios, and I'm finding there's a lot of flexibility, a lot of exceptions and a lot of strange behaviour that is hard to account for. Asynchronously fetching includes is relatively easy, even doing nested includes: as long as it's not conditional. Doing conditional resolution of includes though... a lot of these edge-cases can be pretty nightmarish.
if
: ifdef::should-include-partial[include::{partial-name}.adoc[\]]
. This one also has an attribute in the square brackets.if
s, but how these work is unclear.if
(e.g. ifdef::some-attribute[]
followed later by endif::a-different-attribute[]
), I can't follow the logic at all.if
partials):
I guess it stems from the fact that includes are just points in the target file at which you can substitute the include text with the contents of the referenced file, i.e. partials. There are no syntactical rules for when an include is valid or not. All includes are valid. Some of this unfortunate flexibility is also found with the if
s and attributes.
This is an aside, but are syntactical changes within the scope of Asciidoctor 2.x? If so, there are a few ideas I have to make it more logical and tree-like:
[literal,indent=0]
before any literal paragraph.importattrs::attrs.adoc[attribute1,attribute2]
(also opens it up to importing JSON, TOML and other formats), and when processing an included file use options e.g. include::author.adoc[attrs=firstname,surname,picture]
. These would be breaking changes of course. I'd also suggest building it async-first.
A JSON AST as an output format would be excellent as well: I've dabbled with writing an extension to do this, but haven't gotten very far yet.
@tigregalis No worries! Again thanks for taking the time to write your thoughts.
I think we should iterate on this feature and take one step at a time. While I agree with a lot of things you said I think we should leave attributes for now and focus on unconditional includes that don't rely on attributes. I think it set an achievable goal in a near future and we will still learn a lot in the process:
file://
, http://
, https://
)Even with this limited scope we have a few things to decide.
For instance, should we use an include extension to resolve files from the "virtual file system" or should we add a "native" support in Asciidoctor core ? If we add a native support, Asciidoctor core could also ask the VFS to resolve images when using the data-uri
attribute.
We could create an API to manipulate the VFS (could be useful if the user wants to contribute to it):
vfs.addFile({path: '/path/to/file.adoc', contents: Buffer()})
or create/provide one:
asciidoctor.convert(input, {vfs: vfs})
At some point we will also need to decide if we should replace the include directive with the file content when it's "safe". Might be a premature optimization but we don't need to parse an include twice with this simple example:
main.adoc
include::a.adoc[]
a.adoc
a
In this case we could resolve the include and call Asciidoctor with the following input:
main.adoc
a
Anyway first thing first, let's work on the virtual file system and on the process to fetch files efficiently.
Haven't been keeping up with this for a while. Has this moved at all?
If I want to contribute to the codebase, where would be the best place to start (reading)?
Haven't been keeping up with this for a while. Has this moved at all?
It's still a work in progress: https://github.com/asciidoctor/asciidoctor.js/pull/644 But given the fact that more and more JavaScript libraries are "async first" and that browsers might disallow synchronous XMLHttpRequest in a mid-term future I think it's important to find a solution.
If I want to contribute to the codebase, where would be the best place to start (reading)?
If you want to understand how Asciidoctor resolves the include directive then you should read the Asciidoctor Ruby code base. In fact, the code generated from Opal is harder to read.
The Asciidoctor.js code base is now in the packages/core directory. Please note that most of the code is generated from the Asciidoctor Ruby library.
If you want to override a method from Asciidoctor Ruby you should create a Ruby file. For instance we override the Document#fill_datetime_attributes
method in lib/asciidoctor/js/asciidoctor_ext/document.rb (and the file is required in asciidoctor_ext.rb).
@mojavelinux I'm wondering how we should proceed ?
To be honest, I just don't know. The language was designed in such a way that includes are synchronous and trying to decouple that in any way has side effects that may cause the parser to produce a different result. (Yes, I said what I said above, but in reality it's just more complex).
One possible solution is to do something like what Asciidoctor Reducer does, which is to expand all the includes before parsing (i.e., preprocess the document from top to bottom). The benefit there is that you can leverage the existing parser without having to resort to using what amounts to an alternate parser. Of course, that means Asciidoctor Reducer would need to be able to work with async operations itself...so changes may still be needed, but it's perhaps less invasive.
In the AsciiDoc Language, what we really need to do is be able to walk a document to identify preprocessor directives without parsing...which may even end up being a separate step in the parsing of AsciiDoc. It's a major problem for defining a grammar as well.
browsers might disallow synchronous XMLHttpRequest
IMHO, this is just narrow minded of them.
I'd like to fetch documents and includes asynchronously (in-browser), using
fetch
.By default, it uses synchronous XMLHttpRequest, which locks the browser and also give us a warning:
That's why I'd like to suggest a loadAsync / convertAsync API for Asciidoctor.js.
Perhaps a related suggestion is a way for the user to specify their own http client, e.g. if someone wishes to use fetch, axios, or node-fetch.
As a really convoluted work-around / proof-of-concept, I've managed to use two include processor extensions and Promises to progressively "load" the document. Right now this will only go down to one level of includes. There are a number of other issues of course, but primarily is that it's loading/parsing the document more than once.