PolMine / GermaParl

GermaParl R Data Package
12 stars 3 forks source link

Unrecognized Sorbian Speaker #14

Open PolMine opened 4 years ago

PolMine commented 4 years ago

There is an unrecognized speech given in Sorbian, see this snippet:

corpus("GERMAPARL") %>%
  subset(date == "2004-06-17") %>%
  subset(speaker == "Maria Michalk") %>%
  read()
ChristophLeonhardt commented 4 years ago

I am not sure if the speech isn't recognized. I would say, it is. Maria Michalk does present two speeches here, the first in German, the second (with interruptions and questions in between) in parts in Sorbian.

speeches <- corpus("GERMAPARL") %>%
  subset(date == "2004-06-17") %>%
  subset(speaker == "Maria Michalk") %>%
  as.speeches(s_attribute_name = "speaker")
ablaette commented 4 years ago

I fully agree, there are two distinct speeches. However, if you look at the second one (in Sorbian), something is wrong with the html output. This is a polmineR issue rather than a GermaParl issue.

library(polmineR)

speeches <- corpus("GERMAPARL") %>%
  subset(date == "2004-06-17") %>%
  subset(speaker == "Maria Michalk") %>%
  as.speeches(s_attribute_name = "speaker")

html(speeches[[1]])
html(speeches[[2]])
ChristophLeonhardt commented 4 years ago

I also noticed these odd tags when doing read(speeches[[1]]), yes.