JBGruber / rwhatsapp

An R package for working with WhatsApp data 💬
95 stars 19 forks source link

Author fails to extract if text contains `:` and two or more linebreaks #15

Closed andreblanke closed 4 years ago

andreblanke commented 4 years ago

The author of a message seems to be incorrectly reported as NA if the message text contains both a : and two or more linebreaks.

The following should be a minimum reproducible example:

example.zip

chat0.txt
08.02.20, 17:35 - First Last: The time is 17:35.
2nd line.
3rd line.
chat1.txt
08.02.20, 17:35 - First Last: The time is 17:35.
2nd line.
test.Rmd
---
output: html_notebook
---

```{r}
library("rwhatsapp")
chat0 <- rwa_read("chat0.txt")
chat0
chat1 <- rwa_read("chat1.txt")
chat1


It reports `NA` as author of the message in `chat0.txt` and `First Last` as author of the message in `chat1.txt`.

I don't know if this is related to #14, as I didn't quite understand what that issue is about. Excuse me if it is a duplicate.
JBGruber commented 4 years ago

Wow, thanks for reporting this. I even had problems coming up with a test to reproduce it since this only seems to happen when the first message contains a time plus several lines (so thanks for doing the hard work of narrowing it down to the reprex you posted). It should work now:

rwhatsapp::rwa_read(x = c("08.02.20, 17:35 - First Last: The time is 17:36.",
                          "2nd line.",
                          "3rd line.",
                          "08.02.20, 17:35 - First Last: The time is 17:36.",
                          "2nd line."))
#> # A tibble: 2 x 6
#>   time                author   text                    source   emoji emoji_name
#>   <dttm>              <fct>    <chr>                   <chr>    <lis> <list>    
#> 1 2020-02-08 17:35:26 First L~ "The time is 17:36.\n2~ text in~ <NUL~ <NULL>    
#> 2 2020-02-08 17:35:26 First L~ "The time is 17:36.\n2~ text in~ <NUL~ <NULL>

Created on 2020-02-08 by the reprex package (v0.3.0)

andreblanke commented 4 years ago

Thanks a lot for the quick fix. I thought all other issues in my data set would also stem from this misbehavior, however, it seems there's more situations in which the existing regex is a bit sensitive but I'll file a different issue for those.