Closed sdrakulich closed 4 years ago
Sorry, don't understand this problem. Could you provide a reproducible example? Did you check issue #15? Is this maybe linked to that
Sorry, don't understand this problem. Could you provide a reproducible example? Did you check issue #15? Is this maybe linked to that
I have copied them over while removing meaning. If something "special" was in the message text, I believe I conveyed or otherwise transferred it properly
2017-11-14, 8:43 p.m. - My Name: 4- case study: some text and (text like this) but no linebreak
2017-11-14, 8:43 p.m. - My Name: Messagetext
this is a line
another "line"
I'm in this line. It means I'm in the line.
This line has a slash/and comma, that's it
Here, comma and dash - with a slash/here and (parentheses) again
Comma, and dotdotdot....
COLON: nothing relevant
Nothing but (parentheses)
this line ends in an apostrophe'
2017-11-14, 8:43 p.m. - My Name: next message
2018-03-30, 12:27 p.m. - My Name: Rambling with linebreaks and --> ARROWS like that
More --> Arrows
last line -->..... hopefully not too bad
Anywhere a hyphen is present in the message text. Doesn't care about spacing, words, numbers, it all forces a new message to start parsing. Especially troublesome in URLs because they produce many lines.
I did the others manually, but this one has over a thousand rows, sometimes multiple skips per message. Hopefully this has a fix.
It may be worth noting that
chat %>% filter(is.na(author) & !is.na(text))
and
chat %>% filter(is.na(author) & (!emoji == "NULL" | !is.na(emoji)))
both end up being the same length
A crazy paste I won't even attempt to convey. R code pasted into chat. Can't expect it to work. and Whoever wrote the message mashed their keyboard to the tune of random symbols. Can't expect it to work.
Solutions: Could the chat be parsed if all hyphens and colons were removed? It butchers the chat a bit, but I'm thinking it might be preferred.
Thanks a lot!
I found several small issues while working with these examples:
Example1 <- c("", "2017-11-14, 8:43 p.m. - My Name: 4- case study: some text and (text like this) but no linebreak")
rwhatsapp::rwa_read(Example1)
#> # A tibble: 1 x 6
#> time author text source emoji emoji_name
#> <dttm> <fct> <chr> <chr> <lis> <list>
#> 1 2017-11-14 20:43:38 My Name 4- case study: some text… text i… <NUL… <NULL>
Turns out that rwhatsapp
previously couldn't handle "p.m." but expected "PM". This is now fixed.
Example2 <- c("2017-11-14, 8:43 p.m. - My Name: Messagetext",
"",
"this is a line",
"another \"line\"",
"I'm in this line. It means I'm in the line.",
"This line has a slash/and comma, that's it",
"",
"Here, comma and dash - with a slash/here and (parentheses) again",
"Comma, and dotdotdot....",
"COLON: nothing relevant",
"",
"Nothing but (parentheses)",
"",
"this line ends in an apostrophe'",
"2017-11-14, 8:43 p.m. - My Name: next message",
rep("2017-11-15, 8:43 a.m. - My Name: Messagetext", 20)) # To guess the correct time format, more than half the messages have to have the sam format, so I'm adding a few normal messages
rwhatsapp::rwa_read(Example2)
#> # A tibble: 22 x 6
#> time author text source emoji emoji_name
#> <dttm> <fct> <chr> <chr> <lis> <list>
#> 1 2017-11-14 20:43:38 My Name "Messagetext\nthis is a… text i… <NUL… <NULL>
#> 2 2017-11-14 20:43:38 My Name "next message" text i… <NUL… <NULL>
#> 3 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 4 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 5 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 6 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 7 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 8 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 9 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 10 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> # … with 12 more rows
rwhatsapp::rwa_read(Example2)$text
#> [1] "Messagetext\nthis is a line\nanother \"line\"\nI'm in this line. It means I'm in the line.\nThis line has a slash/and comma, that's it\nHere, comma and dash - with a slash/here and (parentheses) again\nComma, and dotdotdot....\nCOLON: nothing relevant\nNothing but (parentheses)\nthis line ends in an apostrophe'"
#> [2] "next message"
#> [3] "Messagetext"
#> [4] "Messagetext"
#> [5] "Messagetext"
#> [6] "Messagetext"
#> [7] "Messagetext"
#> [8] "Messagetext"
#> [9] "Messagetext"
#> [10] "Messagetext"
#> [11] "Messagetext"
#> [12] "Messagetext"
#> [13] "Messagetext"
#> [14] "Messagetext"
#> [15] "Messagetext"
#> [16] "Messagetext"
#> [17] "Messagetext"
#> [18] "Messagetext"
#> [19] "Messagetext"
#> [20] "Messagetext"
#> [21] "Messagetext"
#> [22] "Messagetext"
The main problem here is that with the time format in these messages, the -
character has a special meaning (end of timestamp). I worked around this and it should work now.
Example3 <- c("2018-03-30, 12:27 p.m. - My Name: Rambling with linebreaks and --> ARROWS like that",
"",
"More --> Arrows",
"",
"last line -->..... hopefully not too bad",
rep("2017-11-15, 8:43 a.m. - My Name: Messagetext", 20))
rwhatsapp::rwa_read(Example3)
#> # A tibble: 21 x 6
#> time author text source emoji emoji_name
#> <dttm> <fct> <chr> <chr> <lis> <list>
#> 1 2018-03-30 12:27:38 My Name "Rambling with linebrea… text i… <NUL… <NULL>
#> 2 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 3 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 4 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 5 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 6 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 7 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 8 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 9 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> 10 2017-11-15 08:43:38 My Name "Messagetext" text i… <NUL… <NULL>
#> # … with 11 more rows
chat_raw <- Example3
This works as well now. Not sure what the problem was, to be honest.
Since I haven't heard anything, I assume this is solved.
Sorry to bother you again, tried to figure it out myself to provide a pull request, but I'm not that skilled.
Problem:
is.na(author)
filters out rows where!is.na(text)
Sometimes, this persists for multiple rows, so my initial (crude, I apologize) solution doesn't work. I also tried to make it work for the emojis, and definitely haven't
Basically, the emojis thing doesn't work, and I tried with
str_sub
but that was obviously wrong as well.Thanks so much for the help by the way.
Here's some cases to copy-paste, hopefully it saves like 2.2 seconds ;)
chat %>% filter(is.na(author) & !is.na(text))
chat %>% filter(is.na(author) & !emoji == "NULL")