requery_hier not returning the same number of rows anymore in 2.2.0 #247

Open FredrikKarlssonSpeech opened 3 years ago

FredrikKarlssonSpeech commented 3 years ago

I just noticed an issue with to the 2.2.0 update and requery_heir

This is what used to happen:

> library(dplyr)
> library(emuR) #Loads 2.1.1
> library(tidyr)
> load_emuDB(file.path("..","Data","GU_emuDB")) -> gu
> query(gu,"[CV = C|V ^ Task=pa|ta|ka]") -> patakaC_V
> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
> dim(patakaC_V)
[1] 34293    16
> dim(patakaCVSylls)
[1] 34293    16

Now, if I update to the 2.2.0 version of the package, I do not get the expected behavior:

> remove.packages("emuR")
Removing package from ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
(as ‘lib’ is unspecified)
> install.packages("emuR")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/emuR_2.2.0.tgz'
Content type 'application/x-gzip' length 2824883 bytes (2.7 MB)
downloaded 2.7 MB

The downloaded binary packages are in

> library(emuR)

Attaching package: ‘emuR’

The following object is masked from ‘package:base’:


> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
Warning message:
In requery_hier(gu, patakaC_V, level = "Syllable", collapse = FALSE) :
  Length of requery segment list (17181) differs from input list (34293)!
> dim(patakaCVSylls) # Just to check...
[1] 17181    16
raphywink commented 3 years ago

Ok this would not be good... I spent about 100 hours rewriting almost the entire query engine to actually fix issues like this. Could you maybe send me a reprex with the output you'd expect from the query? And maybe also confirm that the old result was correct?

raphywink commented 3 years ago

I fixed something in the requery which accidentally got rid of duplicate segments in certain queries. Could you maybe check if the current dev version ( fixes the issue? If so then I'll try to release a new version of emuR asap...

FredrikKarlssonSpeech commented 3 years ago

Sorry, I did not see your previous message but I have installed a

> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
Warning message:
In requery_hier(gu, patakaC_V, level = "Syllable", collapse = FALSE) :
  Length of requery segment list (17181) differs from input list (34296)!
> patakaCVSylls %>%
+     distinct() -> patakaSylls
> #requery_hier(gu,patakaSylls,"Task") -> patakaSyllTasks #Not needed?
> requery_hier(gu,patakaC_V,"Task") -> patakaCVSyllTasks
Warning message:
In requery_hier(gu, patakaC_V, "Task") :
  Found missing items in resulting segment list! Replaced missing rows with NA values.
> requery_hier(gu,patakaC_V,level="Syllable",collapse=FALSE) -> patakaCVSylls
Warning message:
In requery_hier(gu, patakaC_V, level = "Syllable", collapse = FALSE) :
  Length of requery segment list (17181) differs from input list (34296)!
> nrow(patakaCVSylls)
[1] 17181
> patakaCVSylls %>%
+     distinct() -> patakaSylls
> nrow(patakaSylls)
[1] 17181
> requery_hier(gu,patakaC_V,"Task") -> patakaCVSyllTasks
Warning message:
In requery_hier(gu, patakaC_V, "Task") :
  Found missing items in resulting segment list! Replaced missing rows with NA values.
> nrow(patakaCVSyllTasks)
[1] 34296

So it seems that the issue remains. I would expect the introduced NAs as the linking is off in 4 instances, but it seems that what the requery_hier returns is a list of unique segments.

The patakaC_V contains simply C and V:s, which in pairs belong to a syllable. So predominately, patakaCVSylls should predominately contain two identical rows for each syllable. (Except for cases where this does not hold then).

So, this code:

> nrow(patakaCVSylls)
[1] 17181
> patakaCVSylls %>%
+     distinct() -> patakaSylls
> nrow(patakaSylls)
[1] 17181

should not return the same result actually. The first nrow should be 34296.