Integrate Docile.jl and Markdown.jl into Base

quinnj commented 9 years ago

Some good discussion started here.

This is to more formerly track integrating the necessary parts into Base since it seems some good consensus is building.

@one-more-minute @MichaelHatherly

ViralBShah commented 9 years ago

Also pinging @shashi @dcjones

MichaelHatherly commented 9 years ago

I'd be happy to get going on this. Also pinging @stevengj since he's be the source of some great input here. I'll put together a PR over the next few days.

ViralBShah commented 9 years ago

Also pinging @johnmyleswhite who originally suggested to me about Docile.jl as a good starting point. Also @dmbates has been looking for this.

MikeInnes commented 9 years ago

@MichaelHatherly I recommend waiting until I've worked through overhauling Markdown.jl before finalising anything / setting up PRs. There will probably be some technical changes to work out within Docile/Markdown to get the string macros etc. working smoothly.

MichaelHatherly commented 9 years ago

Yes, just saw your break-everything branch. I'll wait on your changes.

JeffBezanson commented 9 years ago

Docile does look quite good.

I'm wondering about

Maybe we should add some special syntax to avoid the ->
How much can we cut down the dependencies?

For point (2), I feel the system should be as lazy as possible, just populating metadata with strings until interaction and display happen.

MichaelHatherly commented 9 years ago

Special syntax would be nice. The -> is providing the LineNumberNode that I'm using to get file and line numbers for metadata. Would be great to retain that info.

Docile doesn't really have any dependencies, it's just harvesting strings and metadata. Lexicon's providing the presentation layer.

I agree about the laziness -- I don't have any hard numbers, but when I was parsing docstrings during @doc package loading was quite a bit slower.

jakebolewski commented 9 years ago

Another thing that needs to be hashed out is what non-standard form of markdown we wish to support. Inline latex, tables, and cross references seem necessary.

MichaelHatherly commented 9 years ago

@jakebolewski CommonMark [1, 2] looks reasonably promising. Inline math is a must-have feature -- not sure whether that would be part of the spec though.

[1] https://github.com/MichaelHatherly/Docile.jl/issues/33 [2] http://jgm.github.io/stmd/spec.html

IainNZ commented 9 years ago

HttpServer now uses Docile (thanks to @astrieanna), which could be another interesting case study: https://github.com/JuliaWeb/HttpServer.jl/blob/master/src/HttpServer.jl

MichaelHatherly commented 9 years ago

That's cool, thanks @astrieanna. Guess I can't make breaking changes now!

IainNZ commented 9 years ago

Hah, doesn't stop anyone else ;)

johnmyleswhite commented 9 years ago

I think this is the right way to go. I agree with Jeff's point that special syntax would make Docile nicer to work with.

stevengj commented 9 years ago

CommonMark doesn't have any standard for embedded equations; see this discussion. Pandoc's $...$ + heuristic (opening $ can't be followed by whitespace, closing $ can't be followed by a digit or preceded by whitespace) seems like the most widely used at this point, and is what is used in Jupyter/IJulia.

stevengj commented 9 years ago

My understanding in #3988 was always that there would eventually be a special syntax for this; macros are only for prototyping.

MichaelHatherly commented 9 years ago

Was syntax ever agreed upon for this? Something along the lines of:

doc """
...
"""
function foo(x)

end

Where doc is a new keyword whose ending keyword is function, type, immutable etc.

Or just without the doc at all and any unassigned string above a documentable block of code is taken to be a docstring?

StefanKarpinski commented 9 years ago

Jeff and I just talked about this today and a bare string literal in void context followed by a definition seems like the way to go. This should be lowered by the parser something like this:

"`frob(x)` frobs the heck out of `x`."

function frob(x)
  # commence frobbing
end

becomes the moral equivalent of this:

let doc = "`frob(x)` frobs the heck out of `x`."
  if haskey(__DOC__, :frob)
    __DOC__[:frob] *= doc
  else
    __DOC__[:frob] = doc
  end
end

function frob(x)
  # commence frobbing
end

Important points about this approach:

parsing has no side-effects – the construction of the documentation structure still occurs when the code is actually evaluated, not when it is parsed.
each module has its own const __DOC__ = Dict{Symbol,UTF8String} dictionary; this is important for reloading modules.
This ends up just appending all the docs for a given name, including separate doc strings for a single generic function.

An open issue is how to handle adding methods to functions from other modules. Does the definition go into the current module's __DOC__ dict? What symbol is used for the doc key then?

timholy commented 9 years ago

Just to comment that it's super-exciting to see momentum on this. Looking forward to seeing what emerges.

MichaelHatherly commented 9 years ago

Is using Symbol as the key type a necessary requirement? Does doing this not restrict the kind of things that can be documented -- namely individual Methods of a Function?

If

"`frob(x)` frobs the heck out of `x`."

function frob(x)
  # commence frobbing
end

is instead translated to

function frob(x)
  # commence frobbing
end

let doc = "`frob(x)` frobs the heck out of `x`."
  if haskey(__DOC__, frob)
    __DOC__[frob] *= doc
  else
    __DOC__[frob] = doc
  end
end

then you could use the Function/Method etc as the key instead of a Symbol -- some adjustments to the let-block not shown. Is this approach feasible?

For adding docs to methods that are being extended from those in a different module, I'd be in favour of adding them to the current module's __DOC__. I'd find it a bit odd if the docs I write for a method end up in a different module.

MikeInnes commented 9 years ago

Stefan's proposal looks good to me, but +1 for being either being aware of methods properly or being limited to one docstring per function (as opposed to concatenating each successive dosctring regardless). Another way to do this might be something like

 __DOC__[:frob][(Int, String...)] = "`frob(x)` frobs the heck out of `x`."

function frob(x::Int, ys::String...)
# ...

i.e. indexing doc strings by type as well as name. Key points in this approach:

The redefinition problem is handled at the function level rather than the module level, which means that
1. Redefining functions/methods works in a sane way, as opposed to endlessly concatenating onto the existing doc string
2. This will automatically make reloading modules do the expected thing too, so a module-local __DOC__ isn't necessary to solve that problem (though it might be useful for other reasons)
It removes the dependency on the order of definitions. So you could do fancy things like making more doc strings for more general methods appear first.

(1.i) is my main concern – redefining functions messing up their own docs is something we could probably live with / work around / ignore, but if we can solve this early it will make for a much better interactive experience, I think.

stevengj commented 9 years ago

Key problems with this approach:

As discussed elsewhere (e.g. MichaelHatherly/Docile.jl#29), there is no need for documentation objects to be a string; they can be any Julia object with the appropriate writemime methods. e.g. imagine a documentation object like docfromfile("foo.tex"). A doc keyword allows more generality here.
Docile currently allows additional metadata to be stored in the documentation, e.g. doc "frob(x) ..." { :section => "Frobnicators" }
Docile currently allows doc* vs. doc in order to distinguish documentation for a Function in general vs. documentation for a Method.

One possibility would be to make the doc keyword optional for string literals (including string macros like md"..."), but to allow it for more complicated documentation.

shashi commented 9 years ago

Documentation specific to argument signature is definitely better than concatenation, +1 for documentation being anything with a writemime method. A transformation like:

 __DOC__[:frob][(Int, String...)] = () -> "`frob(x)` frobs the heck out of `x`."

function frob(x::Int, ys::String...)
# ...

would also let us evaluate documentation objects only when they are needed. e.g. help(frob) could call the closure and cache the result.

jakebolewski commented 9 years ago

I know that this is probably not a popular opinion but I really think we should consider using Restructured Text at least for the default markup in Base. It supports everything we will want (inline math / code, cross-links, tables, etc.), supports extensions in the standard for functionality we would want to add, and would allow us to reuse all the tooling in developed in the Python world (Sphinx, ReadTheDocs, etc.) which imo is the best out there.

Otherwise I see us developing yet another superset of Markdown to support our needs which may or may not be consumable by other tools. I guess if we pick a superset with better tooling support (such as PanDoc markdown with all the extensions) we might be able to mitigate this problem.

JeffBezanson commented 9 years ago

These are good points. Having an API for this is key, as that will allow even more flexibility than a keyword. For fancy documentation needs, use the API instead of the special syntax.

It's probably also true that we'll want to associate docs with particular type signatures.

I think associating arbitrary metadata with every docstring is overengineering at this point. Where we are, we can't even ask for help for a simple function in a package.

StefanKarpinski commented 9 years ago

ReStructured Text is awful. I wrote most of the original manual and writing it in Markdown was a pleasure. Writing documentation has been a painful chore ever since we switched from Markdown to RST. Having complicated formatting types for documentation is overkill and something that we can consider, if at all, only if there's strong evidence of a real need in practice. I don't think there will be any such need. There should be essentially no choice about documentation – the worst possible situation is one where everyone writes docs in their personal favorite format and there are a dozen of them. There should be one reasonable way to write docs that works well and that everyone is familiar with. What we generate during parsing should be simple and easy for the parser to construct – i.e. just strings – and these strings should look decent if you just show them as is. Markdown fits the bill perfectly – it is already (by design) how people intuitively markup plain text content.

stevengj commented 9 years ago

@shashi, as I've discussed in the abovementioned Docile issue, the plan for typical documentation objects (e.g. Markdown text) is to store only the unparsed string when the file is loaded. Parsing of the AST, generation of HTML, etcetera, is only performed "lazily" when the help is requested in some format.

@jakebolewski, the choice of format is orthogonal to this feature if my suggestion is adopted. Markdown documentation would be md"..." (creating a MarkdownString object), Restructured Text would be rst"..." (creating a RestructuredText object), etc. Each would have appropriate writemime methods to generate text/html, text/latex, or whatever. We can argue about what format should be used in Base elsewhere.

stevengj commented 9 years ago

@JeffBezanson, we absolutely have to have some kind of metadata if you want to have any possibility of generating offline documentation, because you can't just have a long list of 3000 functions in Base, sorted alphabetically. At the very least, you have to be able to mark what section and subsection of the manual they should appear in.

StefanKarpinski commented 9 years ago

Let's cross that bridge when we get there.

stevengj commented 9 years ago

@StefanKarpinski, the antecedent of "that" in your comment is unclear.

StefanKarpinski commented 9 years ago

I meant the issue of generating offline documentation. Alphabetical listings of documentation with a hierarchy implied by modules is what Java uses, and while that's not amazing, it does work. Indicating how to structure the presentation of docstrings seems like something that could easily be done by providing an external outline that references the objects to be documented in the desired organization.

ViralBShah commented 9 years ago

We already have our functions well separated in modules and if this forces us to do some more refactoring, that is not necessarily bad.

stevengj commented 9 years ago

@StefanKarpinski, we are "there" if we want to use this in Base, replacing our current RST documentation, because we need an offline manual. It would be shortsighted to implement a feature that doesn't satisfy our own immediate needs!

Hierarchies implied only be modules seem unacceptable to me, because most Julia modules have more functions than you would just want to list alphabetically ... our methods aren't nicely sorted into big class hierarchies like in Java, so the Java experience isn't a good analogue here.

@ViralBShah, do you really want to separate Base into zillions of submodules? Anything more than a dozen or so methods I would want to start grouping into subsections in a decent manual, and that would correspond to 100+ modules.

jakebolewski commented 9 years ago

@StefanKarpinski if it is so awful why did you switch? I'm assuming you wanted to use the tooling which is kind of case in point.

stevengj commented 9 years ago

Regarding redefining functions & concatenating strings, the way Docile does it (and the way I originally proposed) is that the documentation (DOC[x] or whatever) is keyed by the Function for the generic documentation, and by Method for method-specific documentation, and in each case stores arbitrary objects. Asking for help(myfunction) would normally give you the generic documentation followed by a list of method-specific docs for methods(myfunction).

This way, redefining functions doesn't concatenate documentation.
Presentation systems (e.g. help(f) or offline docs) have more flexibility in how they order things. e.g. they don't have to sort methods by the order they happened to be defined in, but can instead sort them by their type signatures or whatever. And they can put different methods into a bulleted list or whatever format is desired.
String concatenation is not appropriate anyway if documentation is an arbitrary object, e.g. rst"..." or md"...".

JeffBezanson commented 9 years ago

ReST was not really our choice; I believe @nolta just did the work and it was a very solid improvement at the time. I would be much happier if we could fix the infamous "extra newline" bug. But I agree there is something to be said for a format that's already designed for this very purpose.

Can anybody comment on how python or other languages deal with metadata for docstrings? I'm not 100% opposed to it, but I think it is necessarily an extension of a simpler feature. For example, we are likely to support

"doc string"
f(x) = x

anyway; metadata involves further decorations of that syntax that can be optional.

StefanKarpinski commented 9 years ago

@jakebolewski: Honestly, I didn't want to switch, but someone (@nolta, iirc) has already done the work and it was an improvement, so we just went with it. Read the docs is nice, but I still hate RST and I'm not that thrilled with the rest of the tooling around RST (random newlines anyone?). If we choose RST as a format, we're stuck with it. If we choose a format we like, we can build all the tooling we need.

johnmyleswhite commented 9 years ago

I'm confused about how the void string idea is going to work in the REPL. If I type a string into the REPL, then hit enter, what happens?

StefanKarpinski commented 9 years ago

@johnmyleswhite: Nothing – it has to be a string in void context followed by something that it can be attached to in the same input.

johnmyleswhite commented 9 years ago

So it will matter whether I execute string + code at two prompts or string at prompt, then code at prompt?

dcjones commented 9 years ago

Isn't metadata something that can be handled within the docstring? Pandoc and Jekyll for example both support YAML front matter in markdown docs to attach arbitrary metadata. We also already have pure Julia Markdown and YAML parsers.

stevengj commented 9 years ago

The lack of structured information in Python has been a longstanding problem, as I understand it.

Syntax-wise, I would suggest something like the following:

"a text/plain comment"
f(x) = x

md"A *Markdown* comment."
g(x) = x

doc md"A *Markdown* comment with metadata" { :section => "Math", :subsection => "Special functions" }
besselj(m,x) = ...

const specfuns = { :section => "Math", :subsection => "Special functions" }
doc md"Another *Markdown* comment with predefined metadata." specfuns
bessely(m,x) = ...

That is, you would use the doc keyword for anything more complex than a string literal (or string macro). e.g. for generic documentation objects (not strings), or to provide metadata (a Dict of some kind) which could be a variable as in the last example (to share metadata for several related functions).

StefanKarpinski commented 9 years ago

+1 to metadata inside the doc string.

StefanKarpinski commented 9 years ago

So it will matter whether I execute string + code at two prompts or string at prompt, then code at prompt?

Do you anticipate entering a lot of doc strings at the prompt?

jakebolewski commented 9 years ago

Metadata inside the docstring implies that construction cannot happen lazily as you would have to parse all docstrings (or eval if they are objects) before you could properly organize them.

stevengj commented 9 years ago

Problems with putting the metadata inside the docstring:

It requires us to have "Julia-flavored" markdown (or whatever) with our own magic metadata markers.
It requires us to specify a docstring format, rather than separating format from metadata, and relying on writemime to convert arbitrary objects to output formats.
It makes it hard to share metadata ... e.g. you will often have several methods with the same metadata (e.g. they are all in the section "Mathematical functions" and the subsection "Special functions" as in my example above), and it would be a lot nicer to not have to retype this in each docstring.

StefanKarpinski commented 9 years ago

Package loading is already a performance problem – constructing lots of dicts during parsing is going to make it way worse.

jakebolewski commented 9 years ago

Metadata could be a list of pairs => and then the overhead would be smaller.

StefanKarpinski commented 9 years ago

I have zero problem with there being a Julia-flavored markdown format. I suspect it is inevitable. We should try to make sure that it matches the IJulia-flavored markdown as much as possible.

dcjones commented 9 years ago

It requires us to have "Julia-flavored" markdown (or whatever) with our own magic metadata markers.

YAML has a standard document begin/end markers.

It requires us to specify a docstring format, rather than separating format from metadata.

I don't see a standardized docstring format as a bad thing.

It makes it hard to share metadata

YAML supports references. So you could write the full metadata once, tag it, then reference it in other docstrings.

StefanKarpinski commented 9 years ago

I think that having the organization of how doc strings are presented be external to the code and docs is a better approach. I.e. you have an outline where you refer to entities that have doc strings and a tool that weaves this together into HTML pages that can go online. Trying to cram all of the information needed to weave individual doc strings into a coherent final document just isn't going to work well. The doc strings provide bits of content that can be either consumed individually from the REPL or reused when putting together complete documentation. Putting all the organization and metadata into the doc strings is like a worse version of literate programming, which itself hasn't panned out that well.

JuliaLang / julia

Integrate Docile.jl and Markdown.jl into Base #8514