Inconsistent indentation of Non-Latin `\item` and non-`\item` paragraphs (with linebreaks)

Andrew15-5 commented 1 year ago

original .tex code

\begin{itemize}
  \item один;
  \item два;
  \item три три три три три три три три три три три три три три три три три три три;

    Другой текст.
  \item четыре.
\end{itemize}

yaml settings

defaultIndent: "  "
modifyLineBreaks:
  textWrapOptions:
    blocksBeginWith:
      # other: 'три'
      other: '\w'
    columns: 81
    when: after

actual/given output

With latexindent.pl file.tex:

\begin{itemize}
  \item один;
  \item два;
  \item три три три три три три три три три три три три три три три три три три три;

        Другой текст.
  \item четыре.
\end{itemize}

With latexindent.pl -m file.tex:

\begin{itemize}
  \item один;
  \item два;
  \item три три три три три три три три три три три три три три три три три три
        три;

Другой текст.
  \item четыре.
\end{itemize}

desired or expected output

\begin{itemize}
  \item один;
  \item два;
  \item три три три три три три три три три три три три три три три три три три
        три;

        Другой текст.
  \item четыре.
\end{itemize}

anything else

Here I want to apply linebreaks as well as the correct indentation.

As we discussed earlier (https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458794462, https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458824151) I have to add some kind of regex to modifyLineBreaks:textWrapOptions:blocksBeginWith:other if I want to trigger linebreak for languages with non-Latin characters.

If I put (mostly or 100%) universal regex such as \w then it will work, but it will also affect paragraphs that are not with \item prefix but are still in the same environment as the other \item paragraphs. In the example paragraph, Другой текст. get a 0 indent level instead of 8.

In other words, just applying the config file (without -m option) to the code is perfect, but additionally I need to apply linebreaks. If I instead add -m option then it will ruin the indentation but will correctly add linebreaks.

The only workaround (which I know about) is to change other option to other: 'три'. But this means that for every paragraph, I will have to modify the config file. And there is no workaround if \item paragraph and non-\item paragraph will start with the same word (i.e., три).

cmhughes commented 1 year ago

At the heart of this is the need to understand what the equivalent of a capital letter (A, B, C,...) is in the language you're working in, and how they map to regular expressions.

So, if you can answer that, hopefully we can progress.

On Wed, 8 Mar 2023, 14:34 Andrew Voynov, @.***> wrote:

original .tex code

\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
Другой текст.
\item четыре.\end{itemize}

yaml settings

defaultIndent: " "modifyLineBreaks: textWrapOptions: blocksBeginWith:

other: 'три'
  other: '\w'
columns: 81
when: after
actual/given output

With latexindent.pl file.tex:

\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
    Другой текст.
\item четыре.\end{itemize}

With latexindent.pl -m file.tex:

\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;

Другой текст. \item четыре.\end{itemize}

desired or expected output

\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
    Другой текст.
\item четыре.\end{itemize}

anything else

Here I want to apply linebreaks as well as the correct indentation.

As we discussed earlier (#428 (comment) https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458794462,

428 (comment)

https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458824151) I have to add some kind of regex to modifyLineBreaks:textWrapOptions:blocksBeginWith:other if I want to trigger linebreak for languages with non-Latin characters.

If I put (mostly or 100%) universal regex such as \w then it will work, but it will also affect paragraphs that are not with \item prefix but are still in the same environment as the other \item paragraphs. In the example paragraph, Другой текст. get a 0 indent level instead of 8.

In other words, just applying the config file (without -m option) to the code is perfect, but additionally I need to apply linebreaks. If I instead add -m option then it will ruin the indentation but will correctly add linebreaks.

The only workaround (which I know about) is to change other option to other: 'три'. But this means that for every paragraph, I will have to modify the config file. And there is no workaround if \item paragraph and non-\item paragraph will start with the same word (i.e., три).

— Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/issues/431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYHEIFLUXIHTWE2T563W3CKGJANCNFSM6AAAAAAVT3AZG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Andrew15-5 commented 1 year ago

I don't know you mean by equivalent, but I can list you all letters from Russian language:

АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдеёжзийклмнопрстуфхцчшщъыьэюя

The only letters that can be at the beginning of a sentence are:

АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ

how they map to regular expressions

I don't know what does map mean in this context. Because regular expression can contain any (I hope) Unicode character, you can insert them directly in regex string. As I found out about \w which also works on the Russian alphabet, you can use \w instead to represent any letter you like.

If this doesn't answer your comment, then please elaborate (maybe add some examples). Specifically, explain what do equivalent of a capital letter ... in the language and map to regular expressions mean. I know the words, but they don't make sense to me in this context.

cmhughes commented 1 year ago

In my language (English) the letters A, B, C, D,.... are represented asA-Z in regular expressions.

What is the equivalent in the language you're working in?

On Wed, 8 Mar 2023, 15:36 Andrew Voynov, @.***> wrote:

I don't know you mean by equivalent, but I can list you all letters from Russian language:

АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ абвгдеёжзийклмнопрстуфхцчшщъыьэюя

The only letters that can be at the beginning of a sentence are:

АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ

how they map to regular expressions

I don't know what does map mean in this context. Because regular expression can contain any (I hope) Unicode character, you can insert them directly in regex string. As I found out about \w which also works on the Russian alphabet, you can use \w instead to represent any letter you like.

If this doesn't answer your comment, then please elaborate (maybe add some examples). Specifically, explain what do equivalent of a capital letter ... in the language and map to regular expressions mean. I know the words, but they don't make sense to me in this context.

— Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/issues/431#issuecomment-1460345831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYD2D467HLCVZOP7X43W3CRPJANCNFSM6AAAAAAVT3AZG4 . You are receiving this because you commented.Message ID: @.***>

Andrew15-5 commented 1 year ago

Ahhh! That's exactly what I did! Let me explain.

Andrew15-5 commented 1 year ago

Now I at least know why you are talking about capital letters because I found something. After failing with \w I tested other regexes. Mainly, I tried all Latin and all Cyrillic letters separate and combined. And what is interesting to me is that Latin capital L change the behavior of formatter completely (2nd example under actual/given output), while all other Latin letters did nothing.

I now understand that there is no reason to include [a-zA-Z] in the regex because English paragraphs already formatted correctly, therefore strange L also excluded (and the bug with it). After all that, I made a simple Cyrillic/Russian regex: [а-яёА-ЯЁ]. In many charsets including Unicode, Cyrillic letters ё and Ё are separate, therefore I had to include them too.

P.S. After seeing your last comment, I take back my words: Now I at least know why you are talking about capital letters because your intentions are completely different from "the L bug".

cmhughes commented 1 year ago

Great, so this is resolved?

Andrew15-5 commented 1 year ago

So the working config is:

defaultIndent: "  "
modifyLineBreaks:
  textWrapOptions:
    blocksBeginWith:
      other: '[а-яёА-ЯЁ]'
    columns: 81
    when: after

And here is the Unicode table (to see where ё and Ё located relatively to the whole alphabet):

But for languages like Japanese this is much harder because all characters are scattered across different ranges. But if I want to only add formatting to Hiragana, then range [ぁ-ゖ] probably would work (and changing columns to 41):

Andrew15-5 commented 1 year ago

Great, so this is resolved?

Yes. I hope so... :)

cmhughes / latexindent.pl