Closed FredrikKarlssonSpeech closed 3 years ago
have you looked at the 284th element of your segmentlist in the EMU-webApp?
Yes, and the query is wrong of course and I would not have wanted pitch from that segment to be extracted if I was doing this for research purposes.
But, the point is I think that get_trackdata()
could not have known that the query was incorrectly specified, and I think should have returned the pitch data it would be able to extract, or and NA if there was no data. That the data is missing in the database is maybe not an "Error" as such.
I had a breif look at the code but I think I need a better understanding of the indexing going on here to find the error. https://github.com/IPS-LMU/emuR/blob/e76a33ce322e3daf04280d191c92000438907575/R/emuR-get_trackdata.R#L386-L392
Regardless I think a get_trackdata() call based on a valid database and a segment list that was returned by Emu should maybe not return an error?
get_trackdata()
doesn't care about the query. It is only interested in the start/end times as well as the bundle/session/utts columns. If you have segments that don't contain any samples:
or not enough:
get_trackdata()
will throw an error!
Is it not more reasonable that errors strictly relate to an error in the code? Not oddities in a database. That is what constructs like NA, NaN, Inf and potentially also warning() is for.
That is I think a key idea of data manipulation code. For instance, you can do silly things like this in R and not have the whole code segment that you are executing stop with an error message that is impossible to understand.
1/ 0:10
[1] Inf 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000 0.1666667 0.1428571 0.1250000 0.1111111 0.1000000
I get a result, that clearly does not hide the fact that an odd situation occurred. For instance, if I would like to compute for instance the mean of that vector I will surely not get a result that hides the Inf result.
And you can do the same even in more complex situations, both in base R and the more modern approaches
df <- data.frame(groups=factor(sample(LETTERS[1:4],10,TRUE),levels=c(LETTERS[1:4],"G")),values=rnorm(10))
with(df,tapply(values, list(groups), FUN=mean)
+ )
A B C D G
-0.2843233 0.5377792 -2.0218621 -0.1524356 NA
library(dplyr)
df %>% group_by(groups) %>% summarise_all(mean)
# A tibble: 4 x 2
groups values
<fct> <dbl>
1 A -0.284
2 B 0.538
3 C -2.02
4 D -0.152
So maybe NA should be returned when a track value could not be obtained?
Users will not address these problems in their databases if they are not forced to and in my opinion it is critically important to get user to fix these issues and generate "clean" data sets i.e. emuDBs (which within the EMU-SDMS is usually a single serve with the problematic segments and one or two clicks for a few segments)
@raphywink I am not sure that your examples would indeed constitute an invalid database. I mean it can be perfectly reasonable for some segments to be really short.
With wrassp's default sampling rate for track data being 200 Hz, a segment just short of 5 ms might fall in between two track data samples. One example that comes to mind is VOT, where you may well have a bunch of segments where most are around 30 ms, but some are around 0 ms. I think I even had this exact problem once, and I had to filter out the very short segments in order to be able to get track data. I haven't thought this through completely but it might be useful to get NA values instead of an error.
get_trackdata()
is used for "for these segments give me the trackdata" and get_trackdata()
can't because for a certain segment it can't (reasons given above). The question is: Is that an error or something that should just quietly work (e.g. NA insertion). If it is a error the user is forced to understand why something just went wrong and then manually filter out the "bad" segments.
Well put. And I think the arguments for both options are valid.
On the one hand, users should definitely be required to understand their data. If they do, it won't be much trouble to, as you put it, filter out the bad segments. But it will still be work to do.
And on the other hand, I can be well aware of why those NA values arise and still want an unfiltered result set. As you know, it is often very useful in R not to change the length of a vector. Even just counting the NA values might be something I want to do.
A compromise might be to return the unfiltered result set (with NA inserted where appropriate) and issue a warning – although I know users usually ignore warnings. Off the top of my head, I can't think of a specific function, but I am pretty sure I have seen warnings like "Warning: NA inserted for missing values" in some R packages.
An alternative could also be to return a result set with NAs in it and a segment list containing the segments that should really be checked - but as a list or a result tuple. As in
list(trackdata=<the trackdata with NAs>,erroneous_segments=<segment list>)
This makes construction of analysis scripts more messy, but the user cannot just ignore the warnings().
As get_trackdata()
has the resultType="XYZ"
parameter I don't think changing the result type depending on if there where problems extracting the data or not is a good idea. I'd say either error or NA insertion + warning like @MJochim said.
I have no problems with a warning() + NA solution. The point was however exactly the change in type of course. That is something the user cannot ignore. I actually agree with you that warnings() are usually ignored.
Anyway, the error message should say something about what to about the problem. Perhaps instructions on how to get the problematic segments into a segment list and then how to serve() it from the database?
closing due to inactivity. Please reopen if still an issue
Hi,
Using the database I uploaded before (sans wave files) I struggle to get track data .
The transcriptions were created in Praat and have not been manipulated outside of the Praat and emuR programs.