JBGruber / rwhatsapp

An R package for working with WhatsApp data 💬
94 stars 19 forks source link

Oversensitive message regex #16

Closed andreblanke closed 4 years ago

andreblanke commented 4 years ago

I seem to have found a few more situations in which the existing regular expression is overly eager in identifying messages at wrong places.

Given the following input, rwhatsapp currently identifies the authors "First Last", N/A, and "another usage of a hyphen.\nSentence with a colon".

06.02.20, 09:33 - First Last: Line 1.

Line 2 - usage of a hyphen as en dash to connect sentences,

Line 3 - another usage of a hyphen.

Sentence with a colon: other part of sentence.
06.02.20, 09:41 - First Last: Message 2.

While I don't know too much about the current implementation I think it might be possible to use positive lookaheads to make the implementation more strict.

The following regex doesn't fit all use cases (i.e. mainly the different date formats) but correctly parses the above message:

(?<datetime>[0-9]{2}\.[0-9]{2}\.[0-9]{2}. [0-9]{2}:[0-9]{2}) - (?:(?<sender>.+):\s+)?(?<text>[\s\S]+?)(?=(?:\n[0-9]{2}\.[0-9]{2}\.[0-9]{2}, [0-9]{2}:[0-9]{2} - )|\Z)
JBGruber commented 4 years ago

Thanks!

This was similar to #14. You identified the problem correctly: The hyphen gets picked up as end of timestamp messing up the rest of the message. I think I was able to solve that. Thanks for your specific regex suggestion, but it wouldn't really fit with the current setup.

Anyway, I got it to work:

test <- rwhatsapp::rwa_read(x = c(
  "06.02.20, 09:33 - First Last: Line 1.",
  "",
  "Line 2 - usage of a hyphen as en dash to connect sentences,",
  "",
  "Line 3 - another usage of a hyphen.",
  "",
  "Sentence with a colon: other part of sentence.",
  "06.02.20, 09:41 - First Last: Message 2."
))

test
#> # A tibble: 2 x 6
#>   time                author   text                     source  emoji emoji_name
#>   <dttm>              <fct>    <chr>                    <chr>   <lis> <list>    
#> 1 2020-02-06 09:33:12 First L… "Line 1.\nLine 2 - usag… text i… <NUL… <NULL>    
#> 2 2020-02-06 09:41:12 First L… "Message 2."             text i… <NUL… <NULL>
test$text
#> [1] "Line 1.\nLine 2 - usage of a hyphen as en dash to connect sentences,\nLine 3 - another usage of a hyphen.\nSentence with a colon: other part of sentence."
#> [2] "Message 2."

Created on 2020-02-17 by the reprex package (v0.3.0)

JBGruber commented 4 years ago

Since I haven't heard anything, I assume this is solved.