Should Mr. Docs be XML-only?

vinniefalco commented 1 month ago

If we make Mr. Docs XML-only, we get a number of benefits:

No need for bitcode
No need for plugin extensions
Extraction can be optimized for XML
No need for built-in handlebars, duktape, or Lua
XML to markdown converters can be purpose-built
- flexibility in choice of tooling
- industry standard components like Jinja or handlebars.js
Coverage and unit tests become comprehensive

However there could be downsides:

The XML output needs to be refactored to allow incremental and parallel processing
Language like Python or Javascript could be slower than C++ (no Boost.Unordered)

Some of my rambling:

Maybe our XML output should be close to flat, instead of having nested scopes.

The advantage of making Mr. Docs XML-only is that we can focus on doing one thing and doing it well. We can optimize every step of the extraction for XML, without worrying about bitcode or other representations. And we wouldn't need plugins. This is a tradeoff though, because we have 1. the problem of emitting one huge XML file, and 2. performance questions about converting XML to markdown using a sandboxed language.

And we get to use industry standard components like Jinja or handlebars. The author of the XML converter has complete freedom to use any tools they want.

I'm ok with emitting a single XML file but I think we need to be smart about the format. We should consider flattening the output so that a child comes after the parent rather than being nested in the parent. We can use node IDs or whatever, the "id" field, to refer to the parent. If we order the entries in the XML output from the top scope down to the most nested scope, then the XML converter can ingest parent scopes incrementally and then decide if it wants to launch additional threads to process the children.

vinniefalco commented 1 month ago

The XML format could be "breadth-first." That is, once we have extracted all the symbols into a tree, the XML file is produced by performing a breadth-first visitation on the root. There should be no nested symbols. That is, a namespace will not have children. Instead the children will refer to the namespace by id, and the namespace will be seen first.

alandefreitas commented 1 month ago

This matches my intuition of what MrDocs should have been when it started. It would do one thing well, at least in the MVP, and we would have additional utilities over time. I also have stated that I'm in favor of removing some significant things, although for different reasons, like:

the HTML generator permanently because the path from Asciidoc or XML to HTML is too obvious.
duktape because the version of javascript it supports is so old that it would only confuse people. It's so old that most code people would try to write would include something duktape doesn't support.
lua, at least for now, because integration is not implemented and these bundled files are just there

Each of these things could be reevaluate in the future, but I don't think these dependencies are helping us a lot right now.

However, I'm not too fond of removing the Asciidoc generator now. We depend a lot on Asciidoc for other projects, and removing Asciidoc support now would complicate things a lot. Although this could have been a good idea at the beginning of the project, I don't think it's a good idea now:

In terms of doing one thing well, we would be removing precisely the one thing MrDocs does well because we invested a lot in Asciidoc, which is now in a very stable state. We only have to update the templates whenever there's a new feature, which is not that often. It's also very performant and is better than any template tool users could use in any post-processing step.
Doxygen and all other tools have two generators: one that generates pages and one for post-processing. And the ones that don't only have the pages. MrDocs would be losing to Doxygen on this front. This decision is not arbitrary on their part. Most users just want the pages and don't want to (or don't have time to) write another application just to convert the XML to pages. The XML exposes way more information than the typical user wants to customize. And writing this secondary application always takes a lot of time, as we've seen with Docca. It also fragments the ecosystem because there's no standard system to do that even for Doxygen even after decades.
This would block users from moving from Doxygen. One of the annoyances of Doxygen is having to write this secondary application to customize anything. It's good to have a system that allows people to customize everything, but it's terrible that this is the only solution in Doxygen to customize anything. Once users invest all the work in Doxygen, they are unlikely to write and maintain another secondary application for MrDocs. For instance, we've seen new users lately. I'm sure 100% of them would give up if they learned they had to write this secondary application to get started with MrDocs.
There's no system to convert MrDocs XML to Asciidoc or HTML pages that has been implemented yet. We could say that, in principle, we would be offloading the work to users. And I don't think we are offloading this work to users because I'm 100% sure no user wants to do that. This work would be offloaded to us as users (Boost.URL, etc). We would end up providing an example converter with MrDocs. So, instead of removing work, we would just be reimplementing most of the work we already have in MrDocs (and the work it already does well) in another language that we know is less efficient. If some Python generator is provided in the same project as MrDocs, this would be even more explicit, and it would only complicate testing.
It would break other projects depending on the Asciidoc generator, particularly Boost.URL reference (which is already in the Boost release), all Antora extensions on which it depends, and the adaptations to the Boost release process. I would have to halt work on MrDocs once again to reimplement all of these solutions. It would also make these solutions more complex, with new dependencies and steps for post-processing and the secondary application for post-processing.
We would be reallocating a lot of people and time to new projects, which would slow development in MrDocs even more. I would be back in Antora extensions for months, and someone else would need to come here to reimplement everything we have for Asciidoc in Python. Two people are spent here.
The complete workflow would become much less efficient, and that might go over a threshold we don't want experiment with. The full workflow that generates Asciidoc for Boost.URL takes 3 seconds. That could become many minutes if we have a Python post-processing step. This could sound good enough if we assume it's just something we're offloading to a hypothetical user. But that doesn't sound good when the user is us. For instance, let's assume many other Boost libraries start using MrDocs with the same workflow. Now, the time to build the Boost release could go from 40 minutes to many hours, and we'll have to worry about many other problems we don't have right now, like paying for more self-hosted runners because the public runners are timing out.
The generators and templates are a great selling point of MrDocs. Customizing things in Doxygen is complex and a practical blocker for most people. Boost authors go much further than others to make their library documentation nice, but others never go as far, even for large and successful projects. New tools need to be 10x better for people to move. By providing only XML, our target audience is reduced to the minority of Doxygen users who take the XML route. Our value proposition for them becomes that they'll have to reimplement their secondary application for our XML format (we can't reuse doxygen's format because we extract more information) for the benefit of having some C++ concepts exposed directly in the XML instead of them having to infer it from the Doxygen XML (Doxygen XML already exposes things like concepts).

vinniefalco commented 1 month ago

Most of the stuff you mentioned are not really relevant. The one thing which is:

The complete workflow would become much less efficient

That's a problem.

vinniefalco commented 1 month ago

After some analysis, it looks like XML-only is dead on arrival for now. We should simply continue to develop the Asciidoctor generator we have now, improve the extraction via clang, and we can revisit the XML-only idea later. The problem areas which have to be reimplemented in Python are:

SafeNames
Concurrent processing
Calculation of links

And having an immutable data structure which is shared between multiple Python threads of execution.

alandefreitas commented 1 month ago

Most of the stuff you mentioned are not really relevant

Yes. That's true if we find an easy way to integrate the post-processing step.

it looks like XML-only is dead on arrival for now

We'll only know about efficiency with some experiments.

alandefreitas commented 1 month ago

Pasting some comments by René:

Not knowing how things currently work.. Couldn't you just provide format agnostic templating? Then people only have to deal with learning the template language+system. And they can get whatever format thy wish out of it.

One that gets invoked for all aspects of the object model. Instead of a combination of in-built output text plus templates. It's common to only invoke templates in some systems for "fragments". And one that provides the entire object model when invoked.

PS. Entire object model appropriately scoped/contextualized though.

-- That kind of sounds like the XML-only MrDocs proposed today. MrDocs would only generate XML and the generators would be defined as a folder with a file that describes what to do with that XML.

That can work.. But it can also fail by creating a disconnect between the internal object model and the XML object model. It also introduces a reparse step as people must consume the xml. Which, depending on the language, the XML implementation, and the size of the XML data and introduce serious overheads.

But on the flip side.. Having the templating engine built-in does limit you to that engine. Which may or may not be a issue,

I guess the ultimate solution would be use only a built-in templating system and include an XML output template written in that templating system. Then people have the choice of writing directly with the template engine. Or parsing the XML if that's more convenient.

vinniefalco commented 1 month ago

@grafikrobot in the current implementation, authors can create additional Generator extensions implemented as dynamically loaded DLL or shared object files. So Mr. Docs today is not in theory limited to the Handlebars templates that we have created. The nice thing is that developing for our extensions API does not require a local installation of clang/LLVM source code.

There are overheads with XML but I believe we can mitigate them completely, or close to it, by restructuring the XML. Making it flat instead of hierarchical will allow parallel processing. And we can build a separate index of file offsets to enable random access. I am confident we can make this work, but it isn't clear that Python will deliver us the performance we expect.

Of course I would love your "markdown-agnostic" solution, but the annoying implementation details get in the way. For example, computation of the "SafeName" for a C++ symbol very much depends on the target Markdown language, because some special characters are valid in some markdowns and not others. In fact the SafeName also has to kind of care about the target filesystem, because it can't use characters which are illegal or special for that filesystem. Our current implementation limits the set of characters in a SafeName to only the subset of characters which are not special on linux or Windows partitions.

Furthermore, forming a link to a symbol depends on the target markdown language. It is kind of ugly and for performance reasons best implemented natively. Having a partial / template produce the link would be a mess. The formed link also depends on whether you are doing single-page or multi-page.

In other words there are a small handful of markdown-specific algorithms which are best expressed in C++, having access to the in-memory representation of the program's metadata. I appreciate that we explored the XML-only solution, and I think for now we should just finish what we have so that we have uncovered all the unknowns. Get this into the hands of users and some months or years of field experience. And then, after all the rough edges are exposed and smoothed out we can revisit this XML-only idea.

cppalliance / mrdocs

Should Mr. Docs be XML-only? #678