ewenme / geniusr

work with data & lyrics from Genius
https://ewenme.github.io/geniusr/
Other
50 stars 15 forks source link

Not all Lyrics parsed when separated #3

Closed muschellij2 closed 5 years ago

muschellij2 commented 6 years ago

In scrape_lyrics_url (https://github.com/ewenme/geniusr/blob/master/R/lyrics.R#L90) , and subsequently (https://github.com/ewenme/geniusr/blob/master/R/lyrics.R#L31 in scrape_lyrics_id), there is a line where you take the first element lyrics <- lyrics[1].

In the case below, this results in not grabbing all the lines, as they are separated by sections.

Output from geniusr

library(geniusr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(xml2)
library(rvest)
url = "https://genius.com/Game-of-thrones-the-red-woman-script-annotated"
res = scrape_lyrics_url(url)
dim(res)
#> [1] 52  4

We see the output has only 52 lines, but there a much more on the website.

Output taking out lyrics[1] line

Below, we copy the same code, but comment out the lyrics <- lyrics[1] line.

song_lyrics_url = url
session <- suppressWarnings(rvest::html(song_lyrics_url))
song <- rvest::html_nodes(session, ".header_with_cover_art-primary_info-title") %>% 
  rvest::html_text()
artist <- rvest::html_nodes(session, ".header_with_cover_art-primary_info-primary_artist") %>% 
  rvest::html_text()
lyrics <- rvest::html_nodes(session, ".lyrics p")
xml2::xml_find_all(lyrics, ".//br") %>% xml2::xml_add_sibling("p", 
                                                              "\n")
xml2::xml_find_all(lyrics, ".//br") %>% xml2::xml_remove() %>% 
  head()
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
#> 
#> [[5]]
#> NULL
#> 
#> [[6]]
#> NULL
lyrics <- rvest::html_text(lyrics)
length(lyrics)
#> [1] 11

We see here there are 11 elements to lyrics, and the last one is still relevant:

cat(tail(lyrics, n = 1))
#> Several Night’s Watch brothers allied to ALLISER THORNE are pointing crossbows at the room that DAVOS and the other Night’s Watch brothers allied with JON SNOW are holed up. ALLISER THORNE approaches the door to the room with several men and knocks. The Night’s Watch brothers inside draw their sword and GHOST growls. DAVOS stands up and walks to the door. ALLISER THORNE knocks again.
#> 
#> 
#> ALLISER THORNE: Ser Davos, we have no cause to fight. We are both anointed knights.
#> 
#> 
#> DAVOS: Hear that, lads? Nothing to fear.
#> 
#> 
#> ALLISER THORNE: I will grant amnesty to all brothers who thrown down their arms before nightfall. And you, Ser Davos, I will allow you to travel south, a free man with a fresh horse.
#> 
#> 
#> DAVOS: And some mutton. I’d like some mutton.
#> 
#> 
#> ALLISER THORNE: What?
#> 
#> 
#> DAVOS: I’m not much of a hunter. I’ll need some food if I’m gonna make it south without starving.
#> 
#> 
#> ALLISER THORNE: We’ll give you food. You can bring the Red Woman with you if you like. Or you can leave her here with us, whichever you choose. But surrender by nightfall or this ends with blood.
#> 
#> 
#> DAVOS: Thank you, Ser Alliser. We’ll discuss amongst ourselves and come back to you with an answer.
#> 
#> ALLISER THORNE and his men leave.
#> 
#> 
#> DAVOS: Boys, I’ve been running from men like that ll my life. In my learned opinion, we open that door --
#> 
#> 
#> NIGHT’S WATCHMAN #1: And they’ll slaughter us all.
#> 
#> 
#> NIGHT’S WATCHMAN #2: They want to come in, they’re gonna come in.
#> 
#> 
#> DAVOS: Aye, but we don’t need to make it easy for them.
#> 
#> 
#> NIGHT’S WATCHMAN #2: Edd is our only chance.
#> 
#> 
#> NIGHT’S WATCHMAN #1: It’s a sad fucking statement if Dolorous Edd is our only chance.
#> 
#> 
#> DAVOS: There’s always the Red Woman.
#> 
#> 
#> NIGHT’S WATCHMAN #1: What’s one redhead gonna do against 40 armed men?
#> 
#> 
#> DAVOS: You haven’t seen her do what I’ve seen her do.
#> 
#> CUT TO: CASTLE BLACK - MELISANDRE’S CHAMBER
#> 
#> MELISANDRE is sitting at the edge of her bed by a fire. She looks across the room at a small mirror standing on a table, then walks over to it. MELISANDRE gazes into the mirror, then disrobes. She removes her collar. The gemstone in its center glimmers. She places the collar on the table beside the mirror. The mirror’s reflection reveals that her appearance has changed to that of an en extremely elderly woman. MELISANDRE continues to stare at herself in the mirror, then walks back over to the bed and gets under the covers.

Thus, if we comment this out, we still get a tibble result, and it has the full data:

# lyrics <- lyrics[1] # removed this line
lyrics <- unlist(stringr::str_split(lyrics, pattern = "\n"))
lyrics <- lyrics[lyrics != ""]
lyrics <- lyrics[!stringr::str_detect(lyrics, pattern = "\\[|\\]")]
lyrics <- tibble::tibble(line = lyrics)
lyrics$song_lyrics_url <- song_lyrics_url
lyrics$song_name <- song
lyrics$artist_name <- artist
wanted_result = tibble::as_tibble(lyrics)
dim(res)
#> [1] 52  4
dim(wanted_result)
#> [1] 345   4

I don't know the rationale for that line, but it seems to be there are cases when it works and others that it may not work if things are separated. I'd send a PR with this, but I'm unsure as to the effects and wanted to get some feedback.

@avalcarcel9

ewenme commented 5 years ago

Thanks for checking, this was an oversight on my part. I have removed this line from scrape_lyrics_url and scrape_lyrics_id to ensure lyrics aren't cut off in these cases. Sorry for taking so long...