IPS-LMU / emuR

The main R package for the EMU Speech Database Management System (EMU-SDMS)
http://ips-lmu.github.io/EMU.html
23 stars 15 forks source link

Can not extract data for a row of the segment list #194

Closed FredrikKarlssonSpeech closed 3 years ago

FredrikKarlssonSpeech commented 6 years ago

Hi,

Using the database I uploaded before (sans wave files) I struggle to get track data .

> load_emuDB("DDKDB_emuDB/") -> ddk
INFO: Checking if cache needs update for 1 sessionsand  333 bundles ...
INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
INFO: Nothing to update!
> oneSegment <- query(ddk, "[CV =~.* ^ Num(Syllable,CV) ==1]")
Warning message:
package ‘bindrcpp’ was built under R version 3.4.4 
> get_trackdata(ddk,oneSegment,"pitch",resultType="tibble") -> oneSegmentPitch

  INFO: parsing 1839 pitch files
  |=======================                                                                                                                           |  15%
Error in get_trackdata(ddk, oneSegment, "pitch", resultType = "tibble") : 
  Can not extract data for the 284th row of the segment list:  116922.114583333 149919.96875 PDNC:149_706666666667_fb7b678745f4e47b02417fa446a51b47 bdc723cc-db7f-4956-8801-e87091c6ec6e PDNC 149_706666666667_fb7b678745f4e47b02417fa446a51b47 138 138 CV 86 86 SEGMENT 5612262 7196158 48000 start and/or end times out of bounds
In addition: Warning message:
In get_trackdata(ddk, oneSegment, "pitch", resultType = "tibble") :
  The emusegs/emuRsegs object passed in refers to bundles with in-homogeneous sampling rates in their audio files! Here is a list of all refered to bundles incl. their sampling rate: 
    session                                              name                                             annotates sample_rate
1      PDNC            10_08_c9065d43d3212757971834bf4d3d067c            10_08_c9065d43d3212757971834bf4d3d067c.wav       48000
2      PDNC          100_288_89b28128373b4f2c1dbd7154c23b7105          100_288_89b28128373b4f2c1dbd7154c23b7105.wav       48000
3      PDNC 101_066666666667_f91e96438ac50475f72157eec428ce67 101_066666666667_f91e96438ac50475f72157eec428ce67.wav       48000
4      PDNC           101_28_8c130e18a811f94ff48710a8981a9677           101_28_8c130e18a811f94ff48710a8981a9677.wav       48000
5      PDNC 101_436598639456_06f4dcc7b980fb58577597ccbee5ce2e 101_436598639456_06f4dcc7b980fb58577597ccbee5ce2e.wav       44100
6      PDNC 101_450666666667_9d7ef21efb3b109c2dc64 [... truncated]

The transcriptions were created in Praat and have not been manipulated outside of the Praat and emuR programs.

raphywink commented 6 years ago

have you looked at the 284th element of your segmentlist in the EMU-webApp?

FredrikKarlssonSpeech commented 6 years ago

Yes, and the query is wrong of course and I would not have wanted pitch from that segment to be extracted if I was doing this for research purposes.

But, the point is I think that get_trackdata() could not have known that the query was incorrectly specified, and I think should have returned the pitch data it would be able to extract, or and NA if there was no data. That the data is missing in the database is maybe not an "Error" as such.

I had a breif look at the code but I think I need a better understanding of the indexing going on here to find the error. https://github.com/IPS-LMU/emuR/blob/e76a33ce322e3daf04280d191c92000438907575/R/emuR-get_trackdata.R#L386-L392

Regardless I think a get_trackdata() call based on a valid database and a segment list that was returned by Emu should maybe not return an error?

raphywink commented 6 years ago

get_trackdata()doesn't care about the query. It is only interested in the start/end times as well as the bundle/session/utts columns. If you have segments that don't contain any samples:

screen shot 2018-05-08 at 11 19 26

or not enough:

screen shot 2018-05-08 at 11 28 16

get_trackdata() will throw an error!

FredrikKarlssonSpeech commented 6 years ago

Is it not more reasonable that errors strictly relate to an error in the code? Not oddities in a database. That is what constructs like NA, NaN, Inf and potentially also warning() is for.

That is I think a key idea of data manipulation code. For instance, you can do silly things like this in R and not have the whole code segment that you are executing stop with an error message that is impossible to understand.

1/ 0:10
 [1]       Inf 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000 0.1111111 0.1000000

I get a result, that clearly does not hide the fact that an odd situation occurred. For instance, if I would like to compute for instance the mean of that vector I will surely not get a result that hides the Inf result.

And you can do the same even in more complex situations, both in base R and the more modern approaches

df <- data.frame(groups=factor(sample(LETTERS[1:4],10,TRUE),levels=c(LETTERS[1:4],"G")),values=rnorm(10))
with(df,tapply(values, list(groups), FUN=mean)
+ )
         A          B          C          D          G 
-0.2843233  0.5377792 -2.0218621 -0.1524356         NA 

library(dplyr)

df %>% group_by(groups) %>% summarise_all(mean)
# A tibble: 4 x 2
  groups values
  <fct>   <dbl>
1 A      -0.284
2 B       0.538
3 C      -2.02 
4 D      -0.152

So maybe NA should be returned when a track value could not be obtained?

raphywink commented 6 years ago

Users will not address these problems in their databases if they are not forced to and in my opinion it is critically important to get user to fix these issues and generate "clean" data sets i.e. emuDBs (which within the EMU-SDMS is usually a single serve with the problematic segments and one or two clicks for a few segments)

MJochim commented 6 years ago

@raphywink I am not sure that your examples would indeed constitute an invalid database. I mean it can be perfectly reasonable for some segments to be really short.

With wrassp's default sampling rate for track data being 200 Hz, a segment just short of 5 ms might fall in between two track data samples. One example that comes to mind is VOT, where you may well have a bunch of segments where most are around 30 ms, but some are around 0 ms. I think I even had this exact problem once, and I had to filter out the very short segments in order to be able to get track data. I haven't thought this through completely but it might be useful to get NA values instead of an error.

raphywink commented 6 years ago

get_trackdata() is used for "for these segments give me the trackdata" and get_trackdata() can't because for a certain segment it can't (reasons given above). The question is: Is that an error or something that should just quietly work (e.g. NA insertion). If it is a error the user is forced to understand why something just went wrong and then manually filter out the "bad" segments.

MJochim commented 6 years ago

Well put. And I think the arguments for both options are valid.

On the one hand, users should definitely be required to understand their data. If they do, it won't be much trouble to, as you put it, filter out the bad segments. But it will still be work to do.

And on the other hand, I can be well aware of why those NA values arise and still want an unfiltered result set. As you know, it is often very useful in R not to change the length of a vector. Even just counting the NA values might be something I want to do.

A compromise might be to return the unfiltered result set (with NA inserted where appropriate) and issue a warning – although I know users usually ignore warnings. Off the top of my head, I can't think of a specific function, but I am pretty sure I have seen warnings like "Warning: NA inserted for missing values" in some R packages.

FredrikKarlssonSpeech commented 6 years ago

An alternative could also be to return a result set with NAs in it and a segment list containing the segments that should really be checked - but as a list or a result tuple. As in

list(trackdata=<the trackdata with NAs>,erroneous_segments=<segment list>)

This makes construction of analysis scripts more messy, but the user cannot just ignore the warnings().

raphywink commented 6 years ago

As get_trackdata() has the resultType="XYZ" parameter I don't think changing the result type depending on if there where problems extracting the data or not is a good idea. I'd say either error or NA insertion + warning like @MJochim said.

FredrikKarlssonSpeech commented 6 years ago

I have no problems with a warning() + NA solution. The point was however exactly the change in type of course. That is something the user cannot ignore. I actually agree with you that warnings() are usually ignored.

Anyway, the error message should say something about what to about the problem. Perhaps instructions on how to get the problematic segments into a segment list and then how to serve() it from the database?

raphywink commented 3 years ago

closing due to inactivity. Please reopen if still an issue