Closed Andrew15-5 closed 1 year ago
At the heart of this is the need to understand what the equivalent of a capital letter (A, B, C,...) is in the language you're working in, and how they map to regular expressions.
So, if you can answer that, hopefully we can progress.
On Wed, 8 Mar 2023, 14:34 Andrew Voynov, @.***> wrote:
original .tex code
\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
Другой текст.
\item четыре.\end{itemize}
yaml settings
defaultIndent: " "modifyLineBreaks: textWrapOptions: blocksBeginWith:
other: 'три'
other: '\w' columns: 81 when: after
actual/given output
With latexindent.pl file.tex:
\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
Другой текст.
\item четыре.\end{itemize}
With latexindent.pl -m file.tex:
\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
Другой текст. \item четыре.\end{itemize}
desired or expected output
\begin{itemize} \item один; \item два; \item три три три три три три три три три три три три три три три три три три три;
Другой текст.
\item четыре.\end{itemize}
anything else
Here I want to apply linebreaks as well as the correct indentation.
As we discussed earlier (#428 (comment) https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458794462,
428 (comment)
https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458824151) I have to add some kind of regex to modifyLineBreaks:textWrapOptions:blocksBeginWith:other if I want to trigger linebreak for languages with non-Latin characters.
If I put (mostly or 100%) universal regex such as \w then it will work, but it will also affect paragraphs that are not with \item prefix but are still in the same environment as the other \item paragraphs. In the example paragraph, Другой текст. get a 0 indent level instead of 8.
In other words, just applying the config file (without -m option) to the code is perfect, but additionally I need to apply linebreaks. If I instead add -m option then it will ruin the indentation but will correctly add linebreaks.
The only workaround (which I know about) is to change other option to other: 'три'. But this means that for every paragraph, I will have to modify the config file. And there is no workaround if \item paragraph and non-\item paragraph will start with the same word (i.e., три).
— Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/issues/431, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYHEIFLUXIHTWE2T563W3CKGJANCNFSM6AAAAAAVT3AZG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>
I don't know you mean by equivalent
, but I can list you all letters from Russian language:
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ
абвгдеёжзийклмнопрстуфхцчшщъыьэюя
The only letters that can be at the beginning of a sentence are:
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ
how they map to regular expressions
I don't know what does map
mean in this context. Because regular expression can contain any (I hope) Unicode character, you can insert them directly in regex string. As I found out about \w
which also works on the Russian alphabet, you can use \w
instead to represent any letter you like.
If this doesn't answer your comment, then please elaborate (maybe add some examples). Specifically, explain what do equivalent of a capital letter ... in the language
and map to regular expressions
mean. I know the words, but they don't make sense to me in this context.
In my language (English) the letters A, B, C, D,.... are represented
asA-Z
in regular expressions.
What is the equivalent in the language you're working in?
On Wed, 8 Mar 2023, 15:36 Andrew Voynov, @.***> wrote:
I don't know you mean by equivalent, but I can list you all letters from Russian language:
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ абвгдеёжзийклмнопрстуфхцчшщъыьэюя
The only letters that can be at the beginning of a sentence are:
АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯ
how they map to regular expressions
I don't know what does map mean in this context. Because regular expression can contain any (I hope) Unicode character, you can insert them directly in regex string. As I found out about \w which also works on the Russian alphabet, you can use \w instead to represent any letter you like.
If this doesn't answer your comment, then please elaborate (maybe add some examples). Specifically, explain what do equivalent of a capital letter ... in the language and map to regular expressions mean. I know the words, but they don't make sense to me in this context.
— Reply to this email directly, view it on GitHub https://github.com/cmhughes/latexindent.pl/issues/431#issuecomment-1460345831, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7CYD2D467HLCVZOP7X43W3CRPJANCNFSM6AAAAAAVT3AZG4 . You are receiving this because you commented.Message ID: @.***>
Ahhh! That's exactly what I did! Let me explain.
Now I at least know why you are talking about capital letters because I found something. After failing with \w
I tested other regexes. Mainly, I tried all Latin and all Cyrillic letters separate and combined. And what is interesting to me is that Latin capital L
change the behavior of formatter completely (2nd example under actual/given output
), while all other Latin letters did nothing.
I now understand that there is no reason to include [a-zA-Z]
in the regex because English paragraphs already formatted correctly, therefore strange L
also excluded (and the bug with it). After all that, I made a simple Cyrillic/Russian regex: [а-яёА-ЯЁ]
. In many charsets including Unicode, Cyrillic letters ё
and Ё
are separate, therefore I had to include them too.
P.S. After seeing your last comment, I take back my words: Now I at least know why you are talking about capital letters
because your intentions are completely different from "the L
bug".
Great, so this is resolved?
So the working config is:
defaultIndent: " "
modifyLineBreaks:
textWrapOptions:
blocksBeginWith:
other: '[а-яёА-ЯЁ]'
columns: 81
when: after
And here is the Unicode table (to see where ё
and Ё
located relatively to the whole alphabet):
But for languages like Japanese this is much harder because all characters are scattered across different ranges. But if I want to only add formatting to Hiragana, then range [ぁ-ゖ]
probably would work (and changing columns
to 41
):
Great, so this is resolved?
Yes. I hope so... :)
original .tex code
yaml settings
actual/given output
With
latexindent.pl file.tex
:With
latexindent.pl -m file.tex
:desired or expected output
anything else
Here I want to apply linebreaks as well as the correct indentation.
As we discussed earlier (https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458794462, https://github.com/cmhughes/latexindent.pl/issues/428#issuecomment-1458824151) I have to add some kind of regex to
modifyLineBreaks:textWrapOptions:blocksBeginWith:other
if I want to trigger linebreak for languages with non-Latin characters.If I put (mostly or 100%) universal regex such as
\w
then it will work, but it will also affect paragraphs that are not with\item
prefix but are still in the same environment as the other\item
paragraphs. In the example paragraph,Другой текст.
get a0
indent level instead of8
.In other words, just applying the config file (without
-m
option) to the code is perfect, but additionally I need to apply linebreaks. If I instead add-m
option then it will ruin the indentation but will correctly add linebreaks.The only workaround (which I know about) is to change
other
option toother: 'три'
. But this means that for every paragraph, I will have to modify the config file. And there is no workaround if\item
paragraph and non-\item
paragraph will start with the same word (i.e.,три
).