IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

requery_hier not returning the same number of rows anymore in 2.2.0 #247

Open FredrikKarlssonSpeech opened 3 years ago

FredrikKarlssonSpeech commented 3 years ago

I just noticed an issue with to the 2.2.0 update and requery_heir

This is what used to happen:

> library(dplyr)
> library(emuR) #Loads 2.1.1
> library(tidyr)
> 
> 
> load_emuDB(file.path("..","Data","GU_emuDB")) -> gu
> query(gu,"[CV = C|V ^ Task=pa|ta|ka]") -> patakaC_V
> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
> dim(patakaC_V)
[1] 34293    16
> dim(patakaCVSylls)
[1] 34293    16

Now, if I update to the 2.2.0 version of the package, I do not get the expected behavior:

> remove.packages("emuR")
Removing package from ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
(as ‘lib’ is unspecified)
> install.packages("emuR")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/emuR_2.2.0.tgz'
Content type 'application/x-gzip' length 2824883 bytes (2.7 MB)
==================================================
downloaded 2.7 MB

The downloaded binary packages are in
    /var/folders/vc/lhvg_40x50l3nb3rndb4kwbm0000gp/T//RtmpLqlYhm/downloaded_packages

> library(emuR)

Attaching package: ‘emuR’

The following object is masked from ‘package:base’:

    norm

> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
Warning message:
In requery_hier(gu, patakaC_V, level = "Syllable", collapse = FALSE) :
  Length of requery segment list (17181) differs from input list (34293)!
> dim(patakaCVSylls) # Just to check...
[1] 17181    16
raphywink commented 3 years ago

Ok this would not be good... I spent about 100 hours rewriting almost the entire query engine to actually fix issues like this. Could you maybe send me a reprex with the output you'd expect from the query? And maybe also confirm that the old result was correct?

raphywink commented 3 years ago

I fixed something in the requery which accidentally got rid of duplicate segments in certain queries. Could you maybe check if the current dev version (2.2.0.9000) fixes the issue? If so then I'll try to release a new version of emuR asap...

FredrikKarlssonSpeech commented 3 years ago

Sorry, I did not see your previous message but I have installed a

> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
Warning message:
In requery_hier(gu, patakaC_V, level = "Syllable", collapse = FALSE) :
  Length of requery segment list (17181) differs from input list (34296)!
> patakaCVSylls %>%
+     distinct() -> patakaSylls
> #requery_hier(gu,patakaSylls,"Task") -> patakaSyllTasks #Not needed?
> requery_hier(gu,patakaC_V,"Task") -> patakaCVSyllTasks
Warning message:
In requery_hier(gu, patakaC_V, "Task") :
  Found missing items in resulting segment list! Replaced missing rows with NA values.
> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
Warning message:
In requery_hier(gu, patakaC_V, level = "Syllable", collapse = FALSE) :
  Length of requery segment list (17181) differs from input list (34296)!
> nrow(patakaCVSylls)
[1] 17181
> patakaCVSylls %>%
+     distinct() -> patakaSylls
> nrow(patakaSylls)
[1] 17181
> requery_hier(gu,patakaC_V,"Task") -> patakaCVSyllTasks
Warning message:
In requery_hier(gu, patakaC_V, "Task") :
  Found missing items in resulting segment list! Replaced missing rows with NA values.
> nrow(patakaCVSyllTasks)
[1] 34296

So it seems that the issue remains. I would expect the introduced NAs as the linking is off in 4 instances, but it seems that what the requery_hier returns is a list of unique segments.

The patakaC_V contains simply C and V:s, which in pairs belong to a syllable. So predominately, patakaCVSylls should predominately contain two identical rows for each syllable. (Except for cases where this does not hold then).

So, this code:

> nrow(patakaCVSylls)
[1] 17181
> patakaCVSylls %>%
+     distinct() -> patakaSylls
> nrow(patakaSylls)
[1] 17181

should not return the same result actually. The first nrow should be 34296.