Abbreviation processing order alternative

nbanyan commented 1 month ago

Currently the abbr extension implements the last definition of an abbreviation. In MkDocs with the pymdownx.snippet extension, a glossary can be auto-appended to Markdown pages.

The problem is that this prevents pages from overriding abbreviations from the glossary. To circumvent this I've been manually adding the glossary snippet to the top of all pages. But this requires some page specific abbreviations to be declared before the snippet and some after. As the lists continue to grow this can quickly cause some abbreviations to be incorrect.

Example of the currently required order:

In glossary.md:

*[process]: process definition one
*[ABC]: Abbreviation definition one
*[XYZ]: Abbreviation definition two (newly added)

In page.md:

*[specialized process]: process definition specialization
--8<-- "glossary.md"
*[XYZ-A]: Unrealized Abbreviation due to change in glossary.md
*[ABC]: Abbreviation definition one redefinition

'specialized process' must be defined before the snippet so the entire phrase is processed before 'process' is implemented
'ABC' must be defined after the snippet otherwise it will be overridden by the glossary definition
'XYZ-A' is no longer implemented since the snippet now defines and implements 'XYZ' first

Proposals:

Option 1: An option to have the abbr extension only keep/process the first instance of an abbreviation instead of the last.
Option 2: An option to sort the abbreviation list to process 'superset' abbreviations first to transparently illuminate the issues with terms like the 'specialized process' and 'XYZ-A' in the example.

Both options would allow effective use of auto-append and illuminate the issue of needing to analyze how abbreviations are processed every time one is written and reassessing/adjusting the ordering in pages when a new term is added to the glossary.

The second option would require more processing time, but prevent issues (from Option1) with a page declaring an abbreviation that breaks an abbreviation in an auto-appended glossary.

waylan commented 1 month ago

this requires some page specific abbreviations to be declared before the snippet and some after.

Personally, this is exactly how I would expect to solve this issue, which means we don't need to make any changes. It is not clear to me how this is resulting in incorrect abbreviations. Perhaps an example would help.

Or are you saying that sometimes you get the order wrong and that results in the errors. If that is the case, then I don't see how your suggested changes would help. You would still need to ensure correct order, even if that order would be different.

In any event, previously, Option 1 of your proposal would have been impractical. However, with the recent refactor of the extension in #1461 (not yet released), it would be relatively easy to implement as there is now a global collection of abbreviations stored as a dict (just check if the entry already exists and skip writing if it does). However, Option 2 would be impossible as a dict cannot have two entries with the same key (the abbreviation is used as the key in the collection).

Personally, I would think that it would make more sense for the proposals to apply to abbreviations being defined programicaly (in Python) rather than by including snippets from other files within the Markdown source. In fact, the recent refactor was implemented with that in mind. It would be relatively simple to add a config option which accepted a dict of predefined abbreviations.

Regardless, personally I have no need for any of this and am not inclined to spend any time working on it. However, if someone else wants to, I would be willing to review a PR.

nbanyan commented 1 month ago

Here's an MkDocs example I've put together discussing a use case for use_first_abbr and options for how to define a glossary for abbr. In my opinion, if abbr can use a global glossary then use_first_abbr would be redundant since its purpose is to allow extensions like pymdownx.snippet to provide a glossary while still letting pages define their own abbreviations.

abbr_mkdocs.zip

waylan commented 1 month ago

I'm sorry, but I don't download files from random people. You need to lay out a simple example (much less complex than anything which requires MkDocs) here in code blocks. You need to make a case for why using the first definition solves a problem that cannot be solved using the last definition. And you need to provide the explanation. Don't assume I will look at your example and suddenly get it.

nbanyan commented 1 month ago

Unfortionately the use case is applying a glossary in MkDocs.

use_first_abbr

PyMdown Extensions has a Snippet extension for inserting sections of files into Markdown. It provides an auto_append option which can paste a glossary file to the end of every page.

It is for this scenario that I proposed the use_first_abbr option so that a page with an auto-appended glossary would still be able to use its own definitions.

markdown_extensions:
  - abbr:
      use_first_abbr: True
  - pymdownx.snippets:
      base_path:
        - docs
      auto_append:
        - glossary.md
      check_paths: true

glossary

I coded the glossary option to be a direct replacement for the auto_append+use_first_abbr combination. I left both options in the PR because I implemented the user_first_abbr first and as a simple alternative if glossary is to be put in the backlog for latter. Using the Markdown syntax for the glossary file seemed natural to me from the MkDocs environment.

The only way I can think of to provide a raw object to abbr through MkDocs is by having the glossary be a JSON dictionary in mkdocs.yml, which looks and feels very cumbersome. Especially when the list grows past a couple hundred abbreviations.

markdown_extensions:
  - abbr:
      glossary:
        {
          "CNO" : "Chief of Naval Operations",
          "TCP" : "Telemetry & Command Program",
          "HTML" : "Hyper Text Markup Language",
          "W3C" : "World Wide Web Consortium",
          "DOD" : "Department of Defense"
        }
  - pymdownx.snippets:
      base_path:
        - docs
      check_paths: true
  - pymdownx.superfences

A possible remediation would be to have mkdocs.yml inherit from a .yml dedicated to the abbr extension, but that requires changing markdown_extensions to be a dictionary instead of a list.

markdown_extensions:
  abbr:
    glossary:
      {
        "CNO" : "Chief of Naval Operations",
        "TCP" : "Telemetry & Command Program",
        "HTML" : "Hyper Text Markup Language",
        "W3C" : "World Wide Web Consortium",
        "DOD" : "Department of Defense"
      }

INHERIT: ../abbr.yml
site_name: ...
...
markdown_extensions:
  pymdownx.snippets:
      base_path:
        - docs
      check_paths: true
  pymdownx.superfences: {}

Perhaps adding a json_glossary option to accept a dictionary directly and expanding the glossary option to also accept a file object would streamline this from both the Python and MkDocs perspectives?

waylan commented 1 month ago

PyMdown Extensions has a Snippet extension for inserting sections of files into Markdown. It provides an auto_append option which can paste a glossary file to the end of every page.

This was the detail I was missing. You are using an extension which only ever includes an external file by appending it to the end. My assumption was that you were using one of the "include" extensions, which would require you to explicitly "include" the file at a specific location within the document. In that case, you could define some abbreviations before the "include" and others after. Of course, that would require explicitly defining the include in every document, whereas auto_append allows you to define the include once for all documents.

That said, why does that extension need to only support "append?" Could it maybe also provide a auto_prepend option which would insert the content at the top of a page? I'm not saying that this is the only solution, but it is a possible solution that might be worth exploring. @facelessuser, the Snippet extension is yours; do you have any thoughts on this?

There is one other point I want to make before moving on. All of the issues you have raised could be applied to direct use of the Markdown library without MkDocs. The PyMdown Extensions are Markdown extensions and work with this library directly, There is no need to involve MkDocs. True, if we remove MkDocs from the mix, using a more traditional explicate "include" on a single page works well enough. It is only when we have a large collection of pages that MkDocs is a factor. That said, we also have users who have built their own custom system which uses our library and operates on a large collection of pages. Therefore, removing MkDocs from the discussion does not weekend your point. But it does help to focus the discussion.

I coded the glossary option to be a direct replacement for the auto_append+use_first_abbr combination. I left both options in the PR because I implemented the user_first_abbr first and as a simple alternative if glossary is to be put in the backlog for latter.

Ah, that clears up a few things for me. I was confused why you needed both. Apparently you don't.

The only way I can think of to provide a raw object to abbr through MkDocs is by having the glossary be a JSON dictionary in mkdocs.yml, ...

This is exactly the sort of thing I had in mind. Although, I'm thinking about it from the perspective of the Python API. When calling markdown from Python, how should the collection of abbreviations be passed in? Translating that to MkDocs is something that comes later.

...which looks and feels very cumbersome. Especially when the list grows past a couple hundred abbreviations.

And here is the problem with focusing on MkDocs. A "couple hundred abbreviations" is an unusual need. I don't think that that is a need that we should expect to address in the standard library. I would expect that you would need to write some custom code to retrieve them and format them into something that can be passed to Markdown. What that looks like for MkDocs is likely going to be different that what that looks like for some other environment and beyond the scope of this project.

If you notice, all file reading code in our library is contained in one or two methods. And the documentation is clear that those methods only cover a few common cases. If they don't meet a user's needs, then that user is expected to write their own file reading code. The same principle should apply to acquiring a collection of abbreviations. We can provide a Python API which accepts a Python object. That covers the common case. For the more advanced case (such as yours), the user will need to provide their own solution to retrieve their collection and convert it into the appropriate format accepted by the API.

waylan commented 1 month ago

We can provide a Python API which accepts a Python object. That covers the common case. For the more advanced case (such as yours), the user will need to provide their own solution to retrieve their collection and convert it into the appropriate format accepted by the API.

I realize that that presents a problem for you. As you note, there is no easy way to do this from the MkDocs config directly. I think this is probably a good candidate for a third-party extension to address. I can imagine a few ways to address this. You could use some include extension which gives you the flexibility to include at the location you want. You could create a custom extension which only purpose is to read a glossary file, and then update the dict of abbreviations on the abbr extension. Or you could fork the abbr extension and maintain your own fork which does whatever you want.

facelessuser commented 1 month ago

Snippets has an auto-append feature, but the default requires you to insert at a specific location. So it does not require append, this just seems to be the users preference.

nbanyan commented 1 month ago

I did propose an auto-prepend for Snippets, but how to do that without disturbing frontmatter would be an issue.

waylan commented 1 month ago

Amusingly, MkDocs removes the frontmatter before passing the page content to Markdown, so this would not be an issue in MkDocs. However, it certainly could be a blocker in many other contexts.

Python-Markdown / markdown

Abbreviation processing order alternative #1465

use_first_abbr

glossary