Closed andreblanke closed 4 years ago
Thanks!
This was similar to #14. You identified the problem correctly: The hyphen gets picked up as end of timestamp messing up the rest of the message. I think I was able to solve that. Thanks for your specific regex suggestion, but it wouldn't really fit with the current setup.
Anyway, I got it to work:
test <- rwhatsapp::rwa_read(x = c(
"06.02.20, 09:33 - First Last: Line 1.",
"",
"Line 2 - usage of a hyphen as en dash to connect sentences,",
"",
"Line 3 - another usage of a hyphen.",
"",
"Sentence with a colon: other part of sentence.",
"06.02.20, 09:41 - First Last: Message 2."
))
test
#> # A tibble: 2 x 6
#> time author text source emoji emoji_name
#> <dttm> <fct> <chr> <chr> <lis> <list>
#> 1 2020-02-06 09:33:12 First L… "Line 1.\nLine 2 - usag… text i… <NUL… <NULL>
#> 2 2020-02-06 09:41:12 First L… "Message 2." text i… <NUL… <NULL>
test$text
#> [1] "Line 1.\nLine 2 - usage of a hyphen as en dash to connect sentences,\nLine 3 - another usage of a hyphen.\nSentence with a colon: other part of sentence."
#> [2] "Message 2."
Created on 2020-02-17 by the reprex package (v0.3.0)
Since I haven't heard anything, I assume this is solved.
I seem to have found a few more situations in which the existing regular expression is overly eager in identifying messages at wrong places.
Given the following input, rwhatsapp currently identifies the authors "First Last", N/A, and "another usage of a hyphen.\nSentence with a colon".
While I don't know too much about the current implementation I think it might be possible to use positive lookaheads to make the implementation more strict.
The following regex doesn't fit all use cases (i.e. mainly the different date formats) but correctly parses the above message: