JBGruber / rwhatsapp

An R package for working with WhatsApp data 💬
94 stars 19 forks source link

Multi-line chats (i.e., linebreaks) not properly accounted for #14

Closed sdrakulich closed 4 years ago

sdrakulich commented 4 years ago

Sorry to bother you again, tried to figure it out myself to provide a pull request, but I'm not that skilled.

Problem: is.na(author) filters out rows where !is.na(text)

Sometimes, this persists for multiple rows, so my initial (crude, I apologize) solution doesn't work. I also tried to make it work for the emojis, and definitely haven't

fix_newline_messages <- function(parsed_chat){

    for (row in 1:length(parsed_chat$author)) {

        prev <- row-1
        if (is.na(parsed_chat$author[row]) & !is.na(parsed_chat$text[row])) {
            #Fix Text, split newline with ";"
            parsed_chat$text[prev] <- paste0(parsed_chat$text[prev], "; ", parsed_chat$text[row])

            #Fix Author as well if you want...although not preferred for ease of filtering
            #parsed_chat$author[row] <- parsed_chat$author[prev]
        }

        if (is.na(parsed_chat$author[row]) & !is.na(parsed_chat$emoji[row])) {
            #Fix Emoji
            parsed_chat$emoji[prev] <- append(parsed_chat$emoji[prev], parsed_chat$emoji[row])
            #Fix Emoji Name
            parsed_chat$emoji_name[prev] <- append(parsed_chat$emoji_name[prev], parsed_chat$emoji_name[row])

        }
    }
    return(parsed_chat)
}

Basically, the emojis thing doesn't work, and I tried with str_sub but that was obviously wrong as well.

Thanks so much for the help by the way.

Here's some cases to copy-paste, hopefully it saves like 2.2 seconds ;)

chat %>% filter(is.na(author) & !is.na(text)) chat %>% filter(is.na(author) & !emoji == "NULL")

JBGruber commented 4 years ago

Sorry, don't understand this problem. Could you provide a reproducible example? Did you check issue #15? Is this maybe linked to that

sdrakulich commented 4 years ago

Sorry, don't understand this problem. Could you provide a reproducible example? Did you check issue #15? Is this maybe linked to that

I have copied them over while removing meaning. If something "special" was in the message text, I believe I conveyed or otherwise transferred it properly

Example 1:

2017-11-14, 8:43 p.m. - My Name: 4- case study: some text and (text like this) but no linebreak

Example 2:

2017-11-14, 8:43 p.m. - My Name: Messagetext

this is a line
another "line"
I'm in this line. It means I'm in the line.
This line has a slash/and comma, that's it

Here, comma and dash - with a slash/here and (parentheses) again
Comma, and dotdotdot....
COLON: nothing relevant

Nothing but (parentheses)

this line ends in an apostrophe'
2017-11-14, 8:43 p.m. - My Name: next message

Example 3:

2018-03-30, 12:27 p.m. - My Name: Rambling with linebreaks and --> ARROWS like that

More --> Arrows

last line -->..... hopefully not too bad

Example 4: Forced new line message parsing

Anywhere a hyphen is present in the message text. Doesn't care about spacing, words, numbers, it all forces a new message to start parsing. Especially troublesome in URLs because they produce many lines.

I did the others manually, but this one has over a thousand rows, sometimes multiple skips per message. Hopefully this has a fix.

It may be worth noting that chat %>% filter(is.na(author) & !is.na(text)) and chat %>% filter(is.na(author) & (!emoji == "NULL" | !is.na(emoji))) both end up being the same length

Others:

A crazy paste I won't even attempt to convey. R code pasted into chat. Can't expect it to work. and Whoever wrote the message mashed their keyboard to the tune of random symbols. Can't expect it to work.

Solutions: Could the chat be parsed if all hyphens and colons were removed? It butchers the chat a bit, but I'm thinking it might be preferred.

JBGruber commented 4 years ago

Thanks a lot!

I found several small issues while working with these examples:

Example1 <- c("", "2017-11-14, 8:43 p.m. - My Name: 4- case study: some text and (text like this) but no linebreak")

rwhatsapp::rwa_read(Example1)
#> # A tibble: 1 x 6
#>   time                author  text                      source  emoji emoji_name
#>   <dttm>              <fct>   <chr>                     <chr>   <lis> <list>    
#> 1 2017-11-14 20:43:38 My Name 4- case study: some text… text i… <NUL… <NULL>

Turns out that rwhatsapp previously couldn't handle "p.m." but expected "PM". This is now fixed.

Example2 <- c("2017-11-14, 8:43 p.m. - My Name: Messagetext",
              "",
              "this is a line",
              "another \"line\"",
              "I'm in this line. It means I'm in the line.",
              "This line has a slash/and comma, that's it",
              "",
              "Here, comma and dash - with a slash/here and (parentheses) again",
              "Comma, and dotdotdot....",
              "COLON: nothing relevant",
              "",
              "Nothing but (parentheses)",
              "",
              "this line ends in an apostrophe'",
              "2017-11-14, 8:43 p.m. - My Name: next message",
              rep("2017-11-15, 8:43 a.m. - My Name: Messagetext", 20)) # To guess the correct time format, more than half the messages have to have the sam format, so I'm adding a few normal messages

rwhatsapp::rwa_read(Example2)
#> # A tibble: 22 x 6
#>    time                author  text                     source  emoji emoji_name
#>    <dttm>              <fct>   <chr>                    <chr>   <lis> <list>    
#>  1 2017-11-14 20:43:38 My Name "Messagetext\nthis is a… text i… <NUL… <NULL>    
#>  2 2017-11-14 20:43:38 My Name "next message"           text i… <NUL… <NULL>    
#>  3 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  4 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  5 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  6 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  7 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  8 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  9 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#> 10 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#> # … with 12 more rows
rwhatsapp::rwa_read(Example2)$text
#>  [1] "Messagetext\nthis is a line\nanother \"line\"\nI'm in this line. It means I'm in the line.\nThis line has a slash/and comma, that's it\nHere, comma and dash - with a slash/here and (parentheses) again\nComma, and dotdotdot....\nCOLON: nothing relevant\nNothing but (parentheses)\nthis line ends in an apostrophe'"
#>  [2] "next message"                                                                                                                                                                                                                                                                                                            
#>  [3] "Messagetext"                                                                                                                                                                                                                                                                                                             
#>  [4] "Messagetext"                                                                                                                                                                                                                                                                                                             
#>  [5] "Messagetext"                                                                                                                                                                                                                                                                                                             
#>  [6] "Messagetext"                                                                                                                                                                                                                                                                                                             
#>  [7] "Messagetext"                                                                                                                                                                                                                                                                                                             
#>  [8] "Messagetext"                                                                                                                                                                                                                                                                                                             
#>  [9] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [10] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [11] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [12] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [13] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [14] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [15] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [16] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [17] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [18] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [19] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [20] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [21] "Messagetext"                                                                                                                                                                                                                                                                                                             
#> [22] "Messagetext"

The main problem here is that with the time format in these messages, the - character has a special meaning (end of timestamp). I worked around this and it should work now.

Example3 <- c("2018-03-30, 12:27 p.m. - My Name: Rambling with linebreaks and --> ARROWS like that",
              "",
              "More --> Arrows",
              "",
              "last line -->..... hopefully not too bad",
              rep("2017-11-15, 8:43 a.m. - My Name: Messagetext", 20))

rwhatsapp::rwa_read(Example3)
#> # A tibble: 21 x 6
#>    time                author  text                     source  emoji emoji_name
#>    <dttm>              <fct>   <chr>                    <chr>   <lis> <list>    
#>  1 2018-03-30 12:27:38 My Name "Rambling with linebrea… text i… <NUL… <NULL>    
#>  2 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  3 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  4 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  5 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  6 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  7 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  8 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#>  9 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#> 10 2017-11-15 08:43:38 My Name "Messagetext"            text i… <NUL… <NULL>    
#> # … with 11 more rows

chat_raw <- Example3

This works as well now. Not sure what the problem was, to be honest.

JBGruber commented 4 years ago

Since I haven't heard anything, I assume this is solved.