Open vinniefalco opened 1 month ago
The XML format could be "breadth-first." That is, once we have extracted all the symbols into a tree, the XML file is produced by performing a breadth-first visitation on the root. There should be no nested symbols. That is, a namespace will not have children. Instead the children will refer to the namespace by id, and the namespace will be seen first.
This matches my intuition of what MrDocs should have been when it started. It would do one thing well, at least in the MVP, and we would have additional utilities over time. I also have stated that I'm in favor of removing some significant things, although for different reasons, like:
Each of these things could be reevaluate in the future, but I don't think these dependencies are helping us a lot right now.
However, I'm not too fond of removing the Asciidoc generator now. We depend a lot on Asciidoc for other projects, and removing Asciidoc support now would complicate things a lot. Although this could have been a good idea at the beginning of the project, I don't think it's a good idea now:
Most of the stuff you mentioned are not really relevant. The one thing which is:
The complete workflow would become much less efficient
That's a problem.
After some analysis, it looks like XML-only is dead on arrival for now. We should simply continue to develop the Asciidoctor generator we have now, improve the extraction via clang, and we can revisit the XML-only idea later. The problem areas which have to be reimplemented in Python are:
And having an immutable data structure which is shared between multiple Python threads of execution.
Most of the stuff you mentioned are not really relevant
Yes. That's true if we find an easy way to integrate the post-processing step.
it looks like XML-only is dead on arrival for now
We'll only know about efficiency with some experiments.
Pasting some comments by René:
Not knowing how things currently work.. Couldn't you just provide format agnostic templating? Then people only have to deal with learning the template language+system. And they can get whatever format thy wish out of it.
One that gets invoked for all aspects of the object model. Instead of a combination of in-built output text plus templates. It's common to only invoke templates in some systems for "fragments". And one that provides the entire object model when invoked.
PS. Entire object model appropriately scoped/contextualized though.
-- That kind of sounds like the XML-only MrDocs proposed today. MrDocs would only generate XML and the generators would be defined as a folder with a file that describes what to do with that XML.
That can work.. But it can also fail by creating a disconnect between the internal object model and the XML object model. It also introduces a reparse step as people must consume the xml. Which, depending on the language, the XML implementation, and the size of the XML data and introduce serious overheads.
But on the flip side.. Having the templating engine built-in does limit you to that engine. Which may or may not be a issue,
I guess the ultimate solution would be use only a built-in templating system and include an XML output template written in that templating system. Then people have the choice of writing directly with the template engine. Or parsing the XML if that's more convenient.
@grafikrobot in the current implementation, authors can create additional Generator extensions implemented as dynamically loaded DLL or shared object files. So Mr. Docs today is not in theory limited to the Handlebars templates that we have created. The nice thing is that developing for our extensions API does not require a local installation of clang/LLVM source code.
There are overheads with XML but I believe we can mitigate them completely, or close to it, by restructuring the XML. Making it flat instead of hierarchical will allow parallel processing. And we can build a separate index of file offsets to enable random access. I am confident we can make this work, but it isn't clear that Python will deliver us the performance we expect.
Of course I would love your "markdown-agnostic" solution, but the annoying implementation details get in the way. For example, computation of the "SafeName" for a C++ symbol very much depends on the target Markdown language, because some special characters are valid in some markdowns and not others. In fact the SafeName also has to kind of care about the target filesystem, because it can't use characters which are illegal or special for that filesystem. Our current implementation limits the set of characters in a SafeName to only the subset of characters which are not special on linux or Windows partitions.
Furthermore, forming a link to a symbol depends on the target markdown language. It is kind of ugly and for performance reasons best implemented natively. Having a partial / template produce the link would be a mess. The formed link also depends on whether you are doing single-page or multi-page.
In other words there are a small handful of markdown-specific algorithms which are best expressed in C++, having access to the in-memory representation of the program's metadata. I appreciate that we explored the XML-only solution, and I think for now we should just finish what we have so that we have uncovered all the unknowns. Get this into the hands of users and some months or years of field experience. And then, after all the rough edges are exposed and smoothed out we can revisit this XML-only idea.
If we make Mr. Docs XML-only, we get a number of benefits:
However there could be downsides:
Some of my rambling:
Maybe our XML output should be close to flat, instead of having nested scopes.
The advantage of making Mr. Docs XML-only is that we can focus on doing one thing and doing it well. We can optimize every step of the extraction for XML, without worrying about bitcode or other representations. And we wouldn't need plugins. This is a tradeoff though, because we have 1. the problem of emitting one huge XML file, and 2. performance questions about converting XML to markdown using a sandboxed language.
And we get to use industry standard components like Jinja or handlebars. The author of the XML converter has complete freedom to use any tools they want.
I'm ok with emitting a single XML file but I think we need to be smart about the format. We should consider flattening the output so that a child comes after the parent rather than being nested in the parent. We can use node IDs or whatever, the "id" field, to refer to the parent. If we order the entries in the XML output from the top scope down to the most nested scope, then the XML converter can ingest parent scopes incrementally and then decide if it wants to launch additional threads to process the children.