executablebooks / MyST-Parser

An extended commonmark compliant parser, with bridges to docutils/sphinx
https://myst-parser.readthedocs.io
MIT License
708 stars 189 forks source link

`## Heading 2` produces `<h1>Heading 2</h1>` #901

Open paugier opened 3 months ago

paugier commented 3 months ago

The level of the first heading is taken as <h1> so that

### Heading 3
## Heading 2
#### Heading 4

produces (with myst_suppress_warnings: ['myst.header'])

<section id="heading-3">
<h1>Heading 3</h1>
</section>
<section id="heading-2">
<h1>Heading 2</h1>
<section id="heading-4">
<h2>Heading 4</h2>
</section>
</section>

For the Pelican plugin https://github.com/ashwinvis/myst-reader, we would need ## -> <h2>.

Is it already possible to obtain this behavior with an option?

paugier commented 3 months ago

Note that this strange behavior is not reproduced in the sandbox provided here https://mystmd.org/, for which we get the more reasonable output

<h3>Heading 3</h3>
<h2>Heading 2</h2>
<h4>Heading 4</h4>

Therefore I think it can be considered as a real bug of MyST-Parser...

chrisjsewell commented 3 months ago

Hey @paugier , this is not trivially possibly, without upstream modification to docutils/sphinx

docutils stores documentation in a nested AST:

# H1
## H2
### H3

creates

<section>
    H1
    <section>
        H2
        <section>
            H3

No information on the # number is actually stored or used by the docutils/sphinx HTML writers (since restructuredtext does not have this concept)

If you wanted to start at greater than #, like:

## H2
### H3
#### H4

Then it will simply get stored as:

<section>
    H2
    <section>
        H3
        <section>
            H4

with docutils, the HTML writer has the options; rst2html5 --no-doc-title --initial-header-level=2, which you could then use to "retrieve" the original heading levels

Alternatively, you could modify myst-parser, to capture the # count, e.g.

<section level=2>
    H2
    <section level=3>
        H3
        <section level=4>
            H4

but then you would also need to modify the docutils/sphinx writer(s) to utilise this attribute.

For your case though, it is even more problematic, what you maybe need is for myst-parser to create "phantom" sections, to allow for the correct nesting:

### Heading 3
## Heading 2
#### Heading 4
<section phatom>
    <section phatom>
        <section>
            Heading 3
    <section>
        Heading 2
        <section phatom>
            <section>
                Heading 4

You would then need to modify docutils/sphinx, to handle these phantom sections, e.g. skipping them in the HTML creation

is not reproduced in the sandbox provided her Therefore I think it can be considered as a real bug of MyST-Parser...

mystmd is a separate entity to myst-parser 🙃

chrisjsewell commented 3 months ago

See also https://github.com/jgm/djot/issues/294, I was interested what they have to say on it

paugier commented 3 months ago

Thanks @chrisjsewell for your explanation. So it is not as simple as I thought. For the Pelican plugin using myst (myst-reader), the most important issue is that most of the time the first heading uses ## (since the title is in the frontmatter).

I think we already use --initial-header-level=2 for docutils but we clearly have an issue with Sphinx.

Then, the question is which package/tool should be modified to fix our issue. I guess the simplest quick and dirty solution for myst-reader is to produce the html with sphinx and do few replacements when we detect it's necessary.

But we could also ask sphinx to implement --initial-header-level=2 even if they don't need that for their standard usage. Or using <section level=2> which would be the cleanest solution but requires quite a lot of changes I guess.

I would be interested to know what do you think @chrisjsewell.

Aside question: do you plan to have a myst -> html converter independent of docutils and sphinx, but that would support few important (for myst) sphinx extensions? It would be great!