Bug: document structure (sidebar) does not show certain items.

sebastiaanfranken commented 7 months ago

The document structure (sidebar, F2) does not show a item if it is done like so (for example):

\section{The \textit{quick settings} menu}

If you remove the LaTeX subcommand from it it shows back up in the document structure sidebar.

cvfosammmm commented 7 months ago

Thanks for reporting this. I don't see this as a bug but rather a missing feature. Maybe it would be best if we could extract the plain text in such cases, which might not be easy and might require radical changes to the latex parser.

sebastiaanfranken commented 7 months ago

Just the plain text would be fine. I don't think you'd want markup in the sidebar when looking at the document structure. If I look at other tools they don't have that as well (LibreOffice, MS Word, etc). Having markup in there would be distracting I think.

Does Setzer or the latex parser not have a strip function that only returns the plain text? Maybe that could be used?

cvfosammmm commented 7 months ago

No, it doesn't, but we would need something like that. As a first step we have to correctly parse the section command though. To do that we have to detect that the first closing bracket isn't related to the section tag.

sebastiaanfranken commented 7 months ago

How is the current implementation done in that case? It shows the correct info if there's just text (and no LaTeX commands). Wouldn't it be possible to append that code that if there is LaTeX code found in the remaining string, it's stripped out, or would that be a wrong idea?

sebastiaanfranken commented 7 months ago

Also, I seem to have posted the same bug as #307

cvfosammmm commented 7 months ago

The way it's currently done is more of a quick hack. You can find it in parser_latex.py. It can't be easily extended in this way, I think.

I think the way to go here is to use a parser that builds a real AST (abstract syntax tree). Of course it has to do it incrementally (update on every keystroke). Maybe there's a good external library that does it, preferably written in Python. Any suggestions?

cvfosammmm commented 7 months ago

This looks promising: https://github.com/tree-sitter/py-tree-sitter . There might even be a good off-the-shelf LaTeX grammar for it, that we can use (haven't looked).

cvfosammmm commented 7 months ago

It's not in Debian though, and seems to be slow to load (see https://github.com/latex-lsp/tree-sitter-latex/issues/97). So I think we need something else.

I think we might actually be able to extend the current approach. A first step would be to have a regex that really matches the closing bracket of a command. Not sure though if that's possible with the standard Python re module.

sebastiaanfranken commented 7 months ago

It's not in Debian though, and seems to be slow to load (see latex-lsp/tree-sitter-latex#97). So I think we need something else.

I think we might actually be able to extend the current approach. A first step would be to have a regex that really matches the closing bracket of a command. Not sure though if that's possible with the standard Python re module.

In my very simple testing this works:

re.findall(r"\\section[*]{(.*)}", inputtext)

Which produces a nice list of section(s) I have, but only the inner text, complete with any LaTeX commands inside (like \textit{...})

Granted, this is a extremely simple test, but I'd wager the re module has the required stuff

sebastiaanfranken commented 7 months ago

If anyone is interested, my complete testing barebone code:

import re

inputtext = open("input document.tex", "r", encoding="utf8")
sections = re.findall(r"\\section[*]?{(.*)}", inputtext.read())

print("The LaTeX document has %d section(s):" % len(sections))

if len(sections) > 0:
    for section in sections:
        print(" - %s" % section)

inputtext.close()

cvfosammmm commented 7 months ago

This does not work in general. Just add another command with a closing bracket and try again. Does it match the last bracket?

sebastiaanfranken commented 7 months ago

For me this works. I get the inner text of \section{} and \section*{} items, with and without inner commands. Granted, I get the raw LaTeX commands as well, but this is a 1st step to see if the re module has the options required, which it does.

Edit: For me this is what I see / get returned:

The text has 19 section(s):
 - Woordenlijst en begrippen
 - De \textit{Activiteiten} knop
 - Datum en tijd
 - Het systeemmenu
 - De zoekbalk
 - Het bureaublad
 - Het dash
 - Je thuismap
 - Werken met tabbladen
 - Werken met sjablonen
 - Indeling
 - Zoeken naar een instelling
 - Bronnen
 - Zoeken in detail
 - Indeling
 - Verkennen
 - Ge\"{i}nstalleerd
 - Updates
 - Softwarebronnen

cvfosammmm commented 7 months ago

Actually it doesn't. You can't do bracket matching with plain regexes. But there are extensions that allow it. re does not have them. The reason it works in your case is because of the "greedy" nature of (.*) (see https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy).

That said there might still be ways to match with re most of what occurs in practice.

sebastiaanfranken commented 7 months ago

Yes, I know. And my goal was not to do "bracket matching", but to just get the text from in between the { and }, regardless of what's actually inside said quotes. The parsing (removing?) of LaTeX can come later, or at least that was my train of thought.

By the way you're commenting my train of thought was off I see.

cvfosammmm commented 7 months ago

You have to know which } is the right one though, which requires bracket matching.

A thought that came to my mind just now: maybe we can match either { and } with no opening { inside, or { and } with exactly { and } inside, in that order. That might work in your case without regressions.

sebastiaanfranken commented 7 months ago

If you get the complete contents from the outer { and } inwards, you can parse it again and again, no? That was my train of thought, like peeling an onion. Just do it "layer by layer".

cvfosammmm commented 7 months ago

You could, if you know which ones the outer { and } are, and standard regexes can't decide that, not in general.

sebastiaanfranken commented 7 months ago

Then I wonder what I did in my code, since it does do exactly that. At least with my LaTeX files, so N=1 here. Do you have some exotic LaTeX examples maybe?

cvfosammmm / Setzer

Bug: document structure (sidebar) does not show certain items. #389