warning and problems with some dataset

xhdong-umd commented 5 years ago

I'm running automated tests with all ctmm internal datasets.

@chfleming , with sampled wolf data

There was some warning about duplicated row number, I'm looking at it now.
In speed outlier page it seemed the default estimate function met problem, the app fell back to alternative definition.

Error in if (any(DT)) { : missing value where TRUE/FALSE needed
model selection page

Warning in min(x) : no non-missing arguments to min; returning Inf Warning in max(x) : no non-missing arguments to max; returning -Inf
home range plot, 2 plots have the contour messed up

Warning in CI.UD(x, max(level.UD), max(level), P = TRUE) : Outer contour extends beyond raster. Warning in CI.UD(x[[i]], l, level, P = TRUE) : Outer contour extends beyond raster.
estimating speed took quite some time

xhdong-umd commented 5 years ago

Another minor point: do you think it's a good idea to note the internal data in descriptions that if they are anonymized or calibrated?

chfleming commented 5 years ago

Running the wolf data through outlie(), I am not getting any warnings.

chfleming commented 5 years ago

I ran automated fits on all 8 maned wolves (with level=1), ran summary on the model fit list, and ran summary on each individual model fit in the list. I didn't get any errors or warnings.

Wait I forgot verbose=TRUE... still no errors or warnings.

xhdong-umd commented 5 years ago

Sorry I will get some reproducible code tomorrow. It could be I was testing with 100 points sample. And the outlier was calculated with error on.

xhdong-umd commented 5 years ago

Here is the code to reproduce the warnings. It probably was caused by the sample which make the model less normal.

library(ctmm)
library(ctmmweb)
data(wolf)
data_sample <- pick(wolf, 100)
model_try_res <- par_try_models(data_sample)
model_list <- unlist(model_try_res, recursive = FALSE)
summary(model_list[[5]])

Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf

hrange <- akde(data_sample[["Loba"]], CTMM = model_list[["Loba.OUf isotropic"]])
summary(hrange)  # very large value of CI high
plot(hrange)  # thus the outer contour out of range

Warning messages:
1: In CI.UD(x, max(level.UD), max(level), P = TRUE) :
  Outer contour extends beyond raster.
2: In CI.UD(x[[i]], l, level, P = TRUE) :
  Outer contour extends beyond raster.

NoonanM commented 5 years ago

It appears to be related to the OUΩ anisotropic model.

chfleming commented 5 years ago

I ran the code until a I got a sample that reproduced the warning with summary(). It was a OUΩ model, and I have fixed the warning in the master branch.

However, when I ran ctmm.select() from the command line on the same dataset, I didn't find this OUΩ model to be selected by AIC. Is your script missing the level=1 option?

xhdong-umd commented 5 years ago

We have a model summary table in the app which will run summary on every model, so summary will be called on it even it's not optimal. That being said, I didn't use level=1 in the model selection page.

Should I use that option?

chfleming commented 5 years ago

Ah... I am missing some edges in the stepwise regression of ctmm.select given the new OUf and OUO models. This is causing some bad OUf/OUO models selected over OUF, when the you should ultimately step down to OU or even IID. Give me a few hours to fix this.

chfleming commented 5 years ago

Ok, I've fixed the other bug and am running check now... will push to GitHub momentarily. This is a very bad bug, so I am also going to try to push to CRAN ASAP.

chfleming commented 5 years ago

level=1 is safest but shouldn't be absolutely necessary unless something is very wrong with the shape of the likelihood function.

chfleming commented 5 years ago

Second fix is on GitHub and pushed to CRAN.

Previously, I did not conceive of cases where OUf/OUO would be selected over OUF, yet OU would be selected over both. This is because the ordering of the timescales is like OU < OUF < OUf < OUO.

xhdong-umd commented 5 years ago

I tested with updated ctmm. There is still the warning for home range plot, but that should be expected because the contour is just too big to fit. Should we suppress that warning?

xhdong-umd commented 5 years ago

I tested all ctmm internal dataset with ctmm 0.5.4 in webapp. There is no warning or problem in all pages, though I do see the speed estimation on sampled wolf data still take a long time. With other dataset it could take about 300s, but with sampled wolf data it has been 30 mins and not finished yet.

Is this normal or something we should consider to improve?

I can generate some reproducible code if needed for this.

chfleming commented 5 years ago

One of the wolf datasets is very long and then coarsening it down (which increases uncertainty in the trajectory). I can see how that would be a problematic calculation.

xhdong-umd commented 5 years ago

Is it possible to have a reasonable estimate automatically on how long the calculation will take according to the dataset? It don't have to be actual time (which is impossible depend on user's computer), just "long, short, medium" like estimations.

chfleming commented 5 years ago

In principle, I could take the metric behind the progress bar and pass that to an environmental variable. Could you do something with that?

Also, with speeds(), are you parallelizing over individuals or within the speeds() function. I think parallelizing within the speeds() function should be faster because it is embarrassingly parallel and the individuals may differ considerably, which is definitely the case with the wolves.

xhdong-umd commented 5 years ago

I investigated progress bar approach before. It's easy to have a progress bar in console, I'm not sure if it can update the bar in app, I'll need to look at it.

For speeds it's on individuals now. I'll try to run within speeds to see the result.

xhdong-umd commented 5 years ago

Should I use speed or speeds? I think the page is for average speed so I was using speed.

chfleming commented 5 years ago

speed()

chfleming commented 5 years ago

I created a bad bug in summary() when fixing the min & max warning errors. Now the tau CIs all run from 0 to Inf... and I just pushed to CRAN because the other bug was so bad.

xhdong-umd commented 5 years ago

Uh-oh, that often happens... Last time I found a movebank bug which was caused by my changes to data import part to make it safer.

For speed, using parallel inside should be better. One advantage is speed is not available for some models, so assigning cores to them will be a waste.

On sampled buffalo data, I saw speed took 3s instead of 9s. For sampled wolf data, it's still slow and I saw the progress bar didn't change on 0% after quite some time.

To make the progress value available to the app, another approach is to let your code take a progress function as parameter. You give it a console progress function, it will show progress in console. Given a web app progress function, it will show progress bar in web app. This way you don't have to expose internal progress value to some global variables.

There is probably no time to change speed part before the course, I'll work on it and put it in development version.

xhdong-umd commented 5 years ago

I'm testing all internal dataset with the newest version of ctmm. @chfleming With sampled coatis, some model get a really big speed value, is that normal?

chfleming commented 5 years ago

That's totally normal for a continuous-velocity model that is not the selected model. As the data become increasingly coarse, DOF[speed] approaches zero, and the speed estimate blows up. CIs look appropriately wide. There is almost no information in the data regarding speed. The selected model (OU) actually has infinite speed. You could replace those NA values with Inf (0,Inf) if you want.

xhdong-umd commented 5 years ago

Another result with gazelle, maybe also normal, just to confirm:

a big value with tao period in last one

Big home ranges:

chfleming commented 5 years ago

That's normal if it isn't the selected model and the CIs are appropriately wide. That feature turns off as the period limits to Inf. Infinite oscillation period means that it doesn't oscillate.

xhdong-umd commented 5 years ago

I'm not sure why the unit for tao is microsecond for tutle

xhdong-umd commented 5 years ago

The actual value is like this

                           low       ML      high
area (hectares)       1.135267 1.395365  1.681888
τ[position] (minutes) 0.000000 2.641263  5.904736
τ[velocity] (seconds) 0.000000 0.000000 36.829864

The ML value of τ[velocity] is very small (maybe it should be 0 but was a very small value), so the function chose the smallest unit possible, which in turn make the high value of 36 sec to be a very big value in microseconds.

I'm not sure about the details of how this happened, I need to look at code.

xhdong-umd commented 5 years ago

The unit picking function was looking at median value in a vector, then choose best unit for whole vector. The median value is almost 0 in this case (there are multiple models), so the smallest unit was chosen.

What should we do in this case?

xhdong-umd commented 5 years ago

In units we are taking the smallest unit if the value is very small. I'm thinking maybe we should set a threshold that if even the smallest unit will make a very small value, we may as well just use SI unit and let the value become almost 0.

chfleming commented 5 years ago

What's the ratio between the ML value and high CI?

xhdong-umd commented 5 years ago

It seemed that the ML value is just 0.

                           low       ML      high
area (hectares)       1.135267 1.395365  1.681888
τ[position] (minutes) 0.000000 2.641263  5.904736
τ[velocity] (seconds) 0.000000 0.000000 36.829864

chfleming commented 5 years ago

In summary.ctmm, if the ML value is zero then I switch to the high CI, like this for you:

NONZERO <- (ML > .Machine$double.eps)
if(any(NONZERO)) { TEST <- ML[NONZERO] }
else { TEST <- high }
TEST <- stats::median(TEST)

and then base the units on TEST

chfleming commented 5 years ago

Actually, take out the ML/high. That would have bad results and isn't what I do.

xhdong-umd commented 5 years ago

Do you mean just exclude ML and high value in unit calculation?

chfleming commented 5 years ago

No, sorry, I edited my code. I mean what I have posted now.

xhdong-umd commented 5 years ago

OK, I'll try this option. Because I have unified function to process all columns need to be formatted with units, it's not easy to make change in current structure. The change probably will not have time to come into this release.

App and package updated to 0.2.5, and hosted app was updated with latest ctmm too.

xhdong-umd commented 5 years ago

In my code I need to format all the columns in a table, and all models need to be formatted with same unit, so I have a function to check the whole column then determine the unit. The ML/low/high values are different rows of same column at this time (later they will be reshaped to wider columns, but it's easier to deal with as rows for processing), so it's a little bit difficult to separate ML/high value here.

It can be done, just will need quite some extra structures.

I'm wondering if we can just exclude all zeros in checking unit? Zero value can be in any unit, and they really don't bring any information in determining the unit, instead they will skew the median value (make the function think the zero value is small and need smallest unit).

I think we can simply exclude them then take median. Nothing in ctmm part need to be changed, I only need to add a check in picking unit, and it will work on all columns. If some non-CI columns have similar case of lots of zero then a big number, which can lead to same problem in old code and the ML/high method, and this method will fix that case too.

xhdong-umd commented 5 years ago

Also I think we can make the bar a little bit higher for non-zero. .Machine$double.eps is 2.220446e-16 in my machine, but I think any value smaller than like 1e-9 can be treat like zero already in considering unit. There is not much difference between 3e-9 microsecond and 3e-12 seconds, and the latter is easier to compare and reasoning with SI unit.

xhdong-umd commented 5 years ago

This is much easier to implement. I implemented it and it looks good.

One minor question, is hm2 a commonly used unit and well known to regular users? For myself I need to search to know what it is, and even after that it's hard for me to have a feel how large is 1hm2. Though maybe it's common for animal tracking people?

chfleming commented 5 years ago

Hectares are reasonably common, though people might not recognize that a hectare is a square hectometer.

ctmm-initiative / ctmmweb

warning and problems with some dataset #78