admk / sembr

⚡️ A semantic line breaker that truly breaks lines semantically. Powered by Transformers.
MIT License
15 stars 1 forks source link

Disruption of punctionation (`%`) #1

Open pvandyken opened 3 weeks ago

pvandyken commented 3 weeks ago

The following markdown prompt:

According to reports, lines become 95% shorter after processing with sembr.

Is reformatted to the following:

According to reports,
lines become 95 %shorter
after processing with sembr.

Notice the movement of the % sign to the wrong location. I did a few checks with other punctuation (#, *, $), and they are not affected, but I haven't searched exhaustively.

Edit: I realize after reading the README a bit more closely that this was trained with Latex rather than Markdown? So this may be a latex quirk creeping in. If so, I know you've already expressed the desire for Markdown support, so feel free to close.

admk commented 3 weeks ago

Hi! Thanks for opening the issue. The problem you’re seeing is due to the current text handler treating “%” as a LaTeX comment prefix. While I hope to eventually add support for markdown files, I don’t work with markdown files a lot and lack data to train the models.

The tool is already reasonably good at rewrapping plain text, which makes it sufficient for handling other file types occasionally, so this feature is low priority for now.