Pustur / whatsapp-chat-parser

A package to parse WhatsApp chats with Node.js or in the browser 💬
https://whatsapp-chat-parser.netlify.app
MIT License
158 stars 26 forks source link

Styling in Whatsapp messages #251

Open 4dwaith opened 1 year ago

4dwaith commented 1 year ago

Whatsapp messages have styling within text, such as bold, underline, strikethrough. The given format only allows for plain text messages.

I'm new to open source, but would be happy to help build this feature, but can't think of a way to get this working without breaking the API contract.

Pustur commented 1 year ago

EDIT: They added more styles since I wrote this comment

Hi @4dwaith,

Interesting idea, let's start to see what styles that are supported by whatsapp:

_italic_
*bold*
~strikethrough~
```monospace```

In html that would be:

<i>italic</i>
<b>bold<b>
<s>strikethrough</s>
<pre>monospace</pre>
<!-- or -->
<code>monospace</code>

We would need to either create some new regex patterns to detect the special characters or use a lightweight library to do it for us.

Several tests would be needed to catch the edge cases, for example if you have:

```var my_nice_variable = 'my string';```

It should not format the _nice_ as italic because it's already inside a code block.

Or an url with underscores may get formatted and not work anymore.
There are many things that can go wrong.

With this in mind I think that honestly this could overcomplicate things a bit too much for my liking, I'd like to keep this library dependency-free and as simple as possible.


can't think of a way to get this working without breaking the API contract.

That would not a problem as long as the feature is implemented behind an optional configuration. Something like this:

whatsapp.parseString(text, { parseRichText: true });
speshak commented 11 months ago

@Pustur These sequences look a lot like markdown. Maybe you can use an existing markdown formatter library (or perhaps the consuming code should use a markdown rendering library so you don't have to do anything at all.)

Pustur commented 11 months ago

@speshak I'm more leaning towards the second option, this should be done externally to the library.

Also while the format looks like markdown, it's not exactly a common flavour of it as far as I can tell, in the following example, both the italic and bold are rendered as italic by default:

Marked Demo

It seems possible to customize how that library works but I'm not currently interested in doing so.

4dwaith commented 10 months ago

@Pustur Apologies, I have no idea why I didn't notice your first response. I should've responded months ago.

Not sure about the regex pattern. As specified in your next example, whether or not to parse the italics depends on whether we have previously encountered a code marker. It won't be a context-free state machine, so I don't think we can use regular languages.

That said, I don't think your two examples would have an issue - underscores only indicate italics if there are spaces before the start mark and after the end mark, and no spaces after the start mark and before the end mark. URLs for sure wouldn't follow that rule, though code might.

I've played around a bit, and the rules actually seem straightforward and intuitive. Here are my conclusions

  1. Code markers interrupt and unstyle everything else.

    ``` these *are* _just_ \~five\~ words ``` becomes. these *are* _just_ ~five~ words

    *these ```are just five``` words* becomes these are just five words

  2. Code markers are also the only styles that work across multiple lines
  3. The other three styles are compatible within each other.

    *these _are \~just\~ five_ words* becomes these are ~just~ five words

  4. When two styles conflict, the one that appeared first wins

    *these _are just * five words_ becomes these _are just five words_


Thank you very much for that bit about strikethrough! All this time I thought we would be forced to use CSS attributes and span tags. Can't believe I hadn't heard of that tag, this looks much more doable now.

Pustur commented 10 months ago

The marked library seems relatively easy to extend, I got bold to work, but the new problem is that newlines are not normally respected since markdown needs 2 spaces at the end to insert a <br>

See the Codesandbox demo

Maybe you can make it work properly in the context of Whatsapp messages