ctmm-initiative / ctmmweb

Web app for analyzing animal tracking data, built upon ctmm R package
http://biology.umd.edu/movement.html
GNU General Public License v3.0
34 stars 22 forks source link

Model selection, home range, occurence #27

Closed xhdong-umd closed 7 years ago

xhdong-umd commented 7 years ago

@chfleming @jmcalabrese I wrote a script to do model selection and home range for a list of individuals, use parallel when possible. This is the result of script.

Questions:

  1. Is there any problem with the script doing model selection and home range?
  2. Sometimes only one model returned from ctmm.select even with verbose = TRUE, that's normal, right?
  3. Should we put all the individuals as a list to akde so it will calculate with same grid? This means we cannot parallel this process. The time cost for this step is about 34s for 6 individuals.
xhdong-umd commented 7 years ago

In windows VM, it took 68s to select model for 1 individual, 153s for 6 individuals, 59s to run akde. The parallel speed up is 2.66x, not too bad.

xhdong-umd commented 7 years ago

For the model summaries, I wanted to convert the text format to a table so we can list multiple individuals together and compare easily. It will definitely save some space in app. So instead of

$Cilla
$Cilla$`OUF anisotropic`
$Cilla$`OUF anisotropic`$DOF
    mean     area 
13.16465 22.59234 

$Cilla$`OUF anisotropic`$CI
                                low         ML      high
area (square kilometers) 280.308976 444.213143 645.24293
tau position (days)        3.498487   5.998142  10.28379
tau velocity (minutes)    43.708867  46.725486  49.95030
speed (kilometers/day)    15.441216  15.705364  15.96945

$Gabs
$Gabs$`OUF anisotropic`
$Gabs$`OUF anisotropic`$DOF
    mean     area 
3.845523 6.615316 

$Gabs$`OUF anisotropic`$CI
                                low        ML      high
area (square kilometers) 164.695983 423.07464 801.21216
tau position (days)        3.423846  14.54973  61.82950
tau velocity (minutes)    21.474275  23.76911  26.30918
speed (kilometers/day)    13.348792  13.74202  14.13509

We can have something like this

screen shot 2017-07-28 at 2 46 42 pm

However after I looked the summary.ctmm code, I found this is not easy.

  1. Some information may depend on parameters, like if errors are used. So the output format is not always same.
  2. The unit information is in row names. Because different individuals may have different values and used different units, we need to extract the unit information and put them in values instead of row names. To calculate back SI unit value from the unit name is kind of messy, while getting value from the ctmm object directly need more understanding of ctmm structure.
  3. It's also possible that sometime in future summary.ctmm changed and my conversion code can break.

For now I think I'll just use the summary output directly and finish the main feature first. In the long term do you think it's beneficial to generate a summary table for ctmm object(maybe use another function, the summary.ctmm don't need change since this is the common usage)?

chfleming commented 7 years ago

If you set units=FALSE, then you should always get SI units from summary.ctmm. Tell me if any other functions need this argument as well.

There is a level option for ctmm.select that determines how aggressively alternative models will be considered. If other models are very unlikely and level<1, then ctmm.select may only return one model with verbose=TRUE.

Getting all of the AKDEs on the same grid is for being able to (more quickly) calculate overlaps afterwards. I could potentially code that to parallelize, but with large datasets you can run out of RAM when calculating/manipulating large rasters, and I haven't looked into figuring out how to prevent that.

xhdong-umd commented 7 years ago

I didn't notice the units parameter! I think there are two approaches to get a table summary:

If feel the table summary (with a list of model it's also easier to get a summary table) is somewhat useful to ctmm, I can add the code to summary.ctmm. Otherwise I can just work on summary output and only use this in the app.

I knew about the level option in ctmm.select. I was just using default options in my code. Do you think we need to use different value other than default level?

It seemed that we better keep all AKDEs in same grid. The time cost in this step is not too big so I think it's OK.

chfleming commented 7 years ago

The only thing that should change about the output of summary.cmm in the near-term is the possible addition of further rows for additional parameters.

The default level argument should be fine as long as ctmm.select and ctmm.fit are working properly... and if they aren't then that's something I try to fix with high priority.

xhdong-umd commented 7 years ago

@chfleming @jmcalabrese I've just finished on model selection and home range. The app is updated here. The summary of models and home range are still direct print of summary, I'll work on table format output in next step.

I reorganized the app structure:

In next step I'll work on

xhdong-umd commented 7 years ago

@chfleming I found summary.ctmm with units = FALSE will return speed at m/day, which is not SI unit. My conversion function was using ctmm:::unit with speed dimension and expected m/s.

> summary(model_select_res[[1]][[2]], units = FALSE)
$DOF
    mean     area 
16.89156 30.89841 

$CI
                                low           ML         high
area (square meters)   2.716054e+08 400020869.08 552902892.77
tau position (seconds) 1.737586e+05    389310.18    872258.50
tau velocity (seconds) 0.000000e+00      6889.99     50144.14
speed (meters/day)     1.352319e-03     11499.30     35429.87

I can add exception to my function but maybe summary.ctmm should return m/s with units = FALSE?

xhdong-umd commented 7 years ago

I finished the model/home range summary table and updated the app. The table also have same color theme with data summary table in visualization page.

It's tricky because some columns have a lots of digits, might related to some near zero values forced the whole column to be formatted with lots of digits. I have to round off specifically for them. I'm using digits = 2 and round to 3 digits for some columns. If you feel more digits are needed please just tell me.

And the speed column I got was m/day, I'm manually converting it to m/s now. If summary.ctmm will update to m/s with units = FALSE, I will update the code accordingly.

Do you think it's meaningful to have an option to hide low and high values and only show ML values? That can reduce the rows needed to present for multiple models with multiple individuals.

xhdong-umd commented 7 years ago

@chfleming @jmcalabrese In working with occurrence, I found I didn't add the option to use different models in home range (probably spent too much time on table summary format). I plan to use this structure both for home range and occurrence:

This should be very flexible.

For occurrence, I think it's useful to have a checkbox for add/hide location overlay, like Figure S4.1 of appendix 4 of 2016 MEE ctmm paper. However I didn't find the proper code to do this. I tried these but they didn't work as I want. Is it possible to adjust alpha for overlay layer?

plot(ud)
plot(tele_list[[i]], UD = ud, col = "white")

In the end, do you think it's important to add an export feature? The time is not enough to have a fancy report, but at least we can just export the data, models and plots in a .Rdata file, then user can do more things with them. If this is too rudimental with little use, we can also hold off the export feature, just implement the more proper report later.

In last meeting we mentioned that we need to start wrap up app sometime before the AniMove course. If allowing 1 week to wrap up, we still have time until Aug.18 to work on features, right? I think previously we said to stop on features at one week after the meeting, but to finish occurrence properly probably need more time.

xhdong-umd commented 7 years ago

I have made the app work with any model selection.

screen shot 2017-08-09 at 4 43 56 pm screen shot 2017-08-09 at 4 44 28 pm
Error in `row.names<-.data.frame`(`*tmp*`, value = c("min", "max")): invalid 'row.names' length

@chfleming I can share a script to reproduce the error if you want to have a look.

       user  system elapsed 
     56.903   4.659  61.555 
       user  system elapsed 
     29.489   2.667  32.159 
       user  system elapsed 
     41.873   3.874  45.743 
       user  system elapsed 
    156.817  82.872 239.649 
       user  system elapsed 
    437.447 193.244 630.608 
       user  system elapsed 
     28.931   2.928  31.854 
xhdong-umd commented 7 years ago

Sometimes there are 3 types of errors in occurrence calculation or plot. It could be random: same selection will work this time and fail next time.

running parallel with mclapply in cluster of 6 
Assertion failure at kmp_runtime.cpp(6480): __kmp_thread_pool == __null.
OMP: Error #13: Assertion failure at kmp_runtime.cpp(6480).
OMP: Hint: Please submit a bug report with this message, compile and run commands used, and machine configuration info including native compiler and operating system versions. Faster response will be obtained by including all program sources. For information on submitting this issue, please see http://www.intel.com/software/products/support/.

Warning in min(x) : no non-missing arguments to min; returning Inf
Warning in max(x) : no non-missing arguments to max; returning -Inf
Warning in min(x) : no non-missing arguments to min; returning Inf
Warning in max(x) : no non-missing arguments to max; returning -Inf
Warning in value[[3L]](cond) :
  Mvubu - OU anisotropic: Error in plot.window(...): need finite 'xlim' values

Warning in value[[3L]](cond) :
  Mvubu - OU anisotropic: Error in `row.names<-.data.frame`(`*tmp*`, value = c("min", "max")): invalid 'row.names' length

It seemed that the first error was caused by R 3.4, data.table, mcapply.

xhdong-umd commented 7 years ago

Since user can select models for home range and occurrence, I went one step further to make the variograms map to selected models too. Any combination is possible, and user can compare variograms of different models for same individual.

I also pre-select the first models after model fitting finished. So it's more obvious which models are used for variograms and other pages.

vestlink commented 7 years ago

Would you also be able to get "area" when calculating occurence for each animal?

@xhdong-umd mcapply. I don't think it will work on a windows based pc. I've had trouble using this in other calculations I've done. I know taht Microsoft will relaese and win 10 that will be capable of using 4 cores, but I don't know if it is out yet. I think Unix-base OS'es atm are the only ones capable of MC (please correct me if I am wrong - :) )

xhdong-umd commented 7 years ago

@vestlink The app will choose different parallel methods based on platform: mclapply for linux/mac, sock based parLapplyLB for windows. You can try the app in local mode, and the console message will show some information of the parallel method and the time used in the process.

xhdong-umd commented 7 years ago

@chfleming It seemed that the code I tried for overlaying location data actually works as intended.

The plot in paper

screen shot 2017-08-17 at 11 54 30 am

The plot of UD

screen shot 2017-08-17 at 12 00 32 pm

The plot of location data (green) with UD

screen shot 2017-08-17 at 12 00 41 pm

If I set level.UD = 1 that seemed to remove the contours and get similar effect with plot in paper:

plot(selected_occurrences[[1]], level.UD = 1)
screen shot 2017-08-17 at 12 03 46 pm

Interestingly I cannot use

plot(tele_list[[1]], UD = selected_occurrences[[1]], level.UD = 1)
# Error in if (any(I)) { : missing value where TRUE/FALSE needed

Do you think these options are useful for occurrence plots:

Should we generate the plot without contour as options? If so, how can I remove the contour in last script?

xhdong-umd commented 7 years ago

Maybe we can just add one slider for level.UD. The contour will disappear when level.UD become 1. Right now I didn't the obvious difference with or without location overlay.

This is UD plot with level of 0.95

screen shot 2017-08-17 at 4 09 56 pm

with level of 1

screen shot 2017-08-17 at 4 10 03 pm

I didn't see much advantage of the version with location overlay compare to the above two. Actually your paper was saying location overlay is interfering with occurrence plot, so the one without location should be desired, which is the default behavior of plot(UD). I do see different level values can be useful.

screen shot 2017-08-17 at 4 12 10 pm
xhdong-umd commented 7 years ago

I implemented the level slider.

The data in plots below was sampled data so they are not as clear as the full data plot.

I also didn't meet the previous errors of plotting occurrence so far.

screen shot 2017-08-17 at 4 30 29 pm

with level = 1

screen shot 2017-08-17 at 4 30 36 pm
chfleming commented 7 years ago

@xhdong-umd units=FALSE on speed and diffusion parameters should be in SI units now.

vestlink commented 7 years ago

One nice feature one could think about is the possibility to export the 95% (or other % UDs) as shapefiles for import into e.g. QGIS for further analyses...

Sometimes you would also be interested in multiple CL as 20,50 and 85%. Maybe one could think of a way to pass something like this c(0.20, 0.50, 0.95) when making the occurrence plots

xhdong-umd commented 7 years ago

The shapefile export feature will be a part of our general feature of export/save, which should include everything generated in app. That's an important but complex feature, I will start to work on it later.

If everybody think this shapefile export is good to have now, I can also add it now.

The easiest way to input multiple CL seem to be a text box accepting something like 20,50,95, though it will be a little bit more cumbersome than a slider. @chfleming How do you think about this?

xhdong-umd commented 7 years ago

I have updated the app to use the latest ctmm with units=FALSE change. Note the app will report much bigger speed values with previous ctmm version.

I also added a text field for CL:

screen shot 2017-08-18 at 10 11 10 am
vestlink commented 7 years ago

Looks good

On 18 Aug 2017 16:13, "xianghui dong" notifications@github.com wrote:

I have updated the app to use the latest ctmm with units=FALSE change. Note the app will report much bigger speed values with previous ctmm version.

I also added a text field for CL:

[image: screen shot 2017-08-18 at 10 11 10 am] https://user-images.githubusercontent.com/25039897/29462571-d1a4c9b4-83fd-11e7-8d6b-fc17f5a3cc59.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ctmm-initiative/ctmm-webapp/issues/27#issuecomment-323364928, or mute the thread https://github.com/notifications/unsubscribe-auth/AT0_HB-Vk5wvXO16fTqiqWFTc3O_3mtJks5sZZwBgaJpZM4OmykG .

xhdong-umd commented 7 years ago

The main hosted app is also updated since the features are mostly stable now.

chfleming commented 7 years ago

@xhdong-umd

@vestlink

xhdong-umd commented 7 years ago

I found contours can be turned off with level.UD = 1, i.e. input 100 in the text field, that should serve the purpose. Adding another checkbox will conflict with the level input box.

I'm not seeing any error with occurrence now with the updated ctmm.

For exporting data, what kind of options should be provided?

Shiny only support to download one file with clicking a button. I think maybe it's easier to have an export page, with all kinds of options and multiple output, then I will compress all the files into a zip.

I can also add the plot output if that will be useful, though I'm not sure if there is time to implement all of these. I think you mentioned before that some plots should be saved as pdf instead of png?

chfleming commented 7 years ago

I think shapefiles for the home-range contours will be the most important export support (aside from PNG images), because that's a format for people using GIS software, whereas sp and raster are for working within R.

I'm not sure if there are unintended consequences with setting level.UD=1, as that would be drawing contours around the rectangle of the raster.

xhdong-umd commented 7 years ago

If it's not conceptual right to remove contour by setting level.UD=1, we can use this input convention:

This should be intuitive and easy to switch between different options.

xhdong-umd commented 7 years ago

For the export feature, maybe we should just do the single most important feature of exporting home-range shapefiles. The other options will need a complicate design which probably need to be redone later.

chfleming commented 7 years ago

That sounds good for the exporting.

For the parameter estimate CIs, can you have the colors step up: light (low), medium (ML), dark (high) on each individual if they are going to be on their own row.

jmcalabrese commented 7 years ago

@vestlink: I agree with @chfleming about not giving an area estimate for occurrence dists. We don't want to propagate that practice, given the way that folks in the literature misinterpret Brownian bridges as home range estimates.

@xhdong-umd: I think a quick-and-dirty shapefile export for home ranges is fine for now, and then a more comprehensive and carefully designed suite of export options can come sometime after animove.

jmcalabrese commented 7 years ago

Just took a look at the updated webapp, and I think the workflow is very streamlined overall. Nice! I like the reorganization of the former variogram and model selection pages into a single page and think that works very well.

A couple of observations:

1) The default canvas height (400) for figures on the visualization page renders really small on my system. I had to set it to 1000 or so to get it to look decent.

2) I'm not comfortable with having the CIs on all parameter estimates hidden by default. I see the problem with the display getting too clunky if you show them by default, but I don't want to subtly encourage users to ignore the CIs by hiding them. I don't have a good idea about how to show them without cluttering things up too much, but will think about it.

3) The "Confidence levels of contour" box title on the occurrence page is misleading. I would say "Contour level(s) of the occurrence distribution" instead. That controls the contour on the distribution estimate, not the confidence intervals on a particular contour estimate, and I don't want users to mistakenly think they are adjusting the level of a confidence interval.

4) The option to change the displayed contour (as above in #3) would also be useful for home ranges. The same suggestion about terminology would apply.

5) In the model parameter estimates table, speed displays as "NA km/day" for models like OU where speed is undefined, which seems a bit clunky. Could you just simplify that to "NA"?

xhdong-umd commented 7 years ago

@chfleming I adjusted the level input of occurrence plot as above.

I applied different luminance (lightness) for different estimate CIs like this. I'm not sure this will work well, since the combination of lightness and color may look different depend on the actual color, some row may become difficult to read.

screen shot 2017-08-18 at 4 31 33 pm
xhdong-umd commented 7 years ago

@jmcalabrese

  1. I found 400 is not too small if the plot is a landscape picture, because the limiting factor is the page width, when increasing the figure height will not increase figure size at all. With a portrait picture 400 is often too small, originally I have the canvas fixed at 730. However that will have lots of empty space for buffalo data.

    I can set the default to be a bigger value, but 1000 could be too much, especially we have a lot of boxes in same page. Maybe 600 as a start? I can also set the step to be 200, which should make adjust easier. User can always input a number directly in the box.

    @jmcalabrese are you using a high resolution monitor? With a high resolution 400 pixels could be really small. Unfortunately we can only control the absolute pixels instead of something including dpi.

  2. I can change the default to always show CIs. How about we just don't color the low and high rows? That way the table should be obvious and still easy to read. I'll make a screen shot later.

  3. I'll modify the box title. 4. I'll also look at the input for home range plot.

  4. I tried to just get NA in the beginning, but it was quite tricky. The function I used came from scales package, which will not have any special treatment for NA. I modified the function a little bit to recognize NA, then the function is not vectorized anymore. I thought too much efforts are needed just to deal with this single point, so just moved on on the main features.

    I'll look at this again and see if it can be improved.

jmcalabrese commented 7 years ago

@xhdong-umd:

My display res (1920x1080) is not super high by current standards, but 400 still renders pretty small. I would say bump the default up to 600 and lets see what other users think

For displaying the CIs could we just do MLE (Lower CI, Upper CI) in a single column for each parameter? Then there's still some visual distinction between the MLE and the CI, and it should take up less horizontal space then having separate Lower CI and Upper CI columns for each parameter. Furthermore, then each individual would only have one row, which should be cleaner.

If fixing the NA speed thing is too much work, then don't worry about it. It's not a big deal.

xhdong-umd commented 7 years ago

@jmcalabrese @chfleming according to ?plot.telemetry

level.UD: Confidence level of Gaussian ctmm model or UD estimate contours to be displayed. I.e., level.UD=0.50 can yield the 50% core home range within the rendered contours.

(I picked some words from this to make the previous title.)

level : Confidence levels placed on the contour estimates themselves. I.e., the above 50% core home-range area can be estimated with 95% confidence via level=0.95.

I'm adjusting level.UD in plot(selected_occurrences[[1]], level.UD = c(0.5, 0.95, 0.8)) for occurrence plot.

For home range plot, every usage below gave me some errors. Is there something wrong with the usage? I'm not sure which one I should use for home range plot.

plot(tele_list[[2]], UD = selected_hrange_list[[2]], level.UD = c(50, 95))
plot(tele_list[[2]], UD = selected_hrange_list[[2]], level = c(50, 95))
plot(tele_list[[2]], UD = selected_hrange_list[[2]], level = 90)
# this one works but with warning Outer contour extends beyond raster.
plot(tele_list[[2]], UD = selected_hrange_list[[2]], level.UD = 90)  
xhdong-umd commented 7 years ago

@jmcalabrese I'm not sure I understand the MLE (Lower CI, Upper CI) in a single column for each parameter. Does you mean put 3 values in one cell? Can you show some sample rows?

For the NA values, I realized I can actually add one extra step after the unit formatting. It should be easy to do this way.

vestlink commented 7 years ago

@jmcalabrese and @chfleming The 95% Brownian Brigde not being a valid interpretation as a home-range estimator indeed news to me. I may have fallen into the common pitch of mis-interpretation, but have been guided by very recent literature that has mentioned e.g. Brownian Bridges methods as valid estimators. See e.g. "Is there a single best estimator? Selection of home range estimators using area-under-the-curve" by: W David WalterEmail author, Dave P Onorato and Justin W Fischer.

Would their inclusions of BBMM and dBBMM as home-range estimators be wrong in the light of the above?

Should this be discussed other places than here?

I can see, though, why you would want get information regarding the home range as well as the occurrence dist and combine these when analysing e.g. selection, where the home-range would be the available resource and the occurrence dist as the use.

chfleming commented 7 years ago

@vestlink Yes, the lumping of the BB with home-range estimators by Walter et al (and pretty much every one else in the literature) is incorrect. KDE/AKDE and BB/CRAWL estimate completely different distributions. We pointed this out as an ancillary fact in the first appendix of the AKDE paper and in the intro of the Kriging paper.

Justin is leading a paper specifically focused on this distinction that will use real data examples to show how divergent these two classes of estimators can be, and why they shouldn't be interchanged.

chfleming commented 7 years ago

@xhdong-umd

xhdong-umd commented 7 years ago
vestlink commented 7 years ago

@chfleming Thank you for your clarification. The paper will indeed be appriciated. But, I think it is to be expected in a fast pacing and relativley new field as animal movement ecology is.

xhdong-umd commented 7 years ago

@chfleming To have low to high span 3 columns will need 12 columns for area, tau position, tau velocity and speed, though this is something we have to make decision on compromise.

The previous multi rows is not ideal because it's kind of difficult to separate between rows. Put them into columns make each model to one row, which will serve the purpose of table much better - compare models, select models etc.

I thought about this format but not sure if it's optimal. Put 3 values in one cell, separate with some separators. The units are not added yet.

screen shot 2017-08-21 at 8 49 21 am

However the columns will not be able to be sorted in this way.

Maybe the best option is still to put all values in columns, even that means user often need to maximize the app, use horizontal scroll bar. I think the app need max screen estate for most of its tasks anyway.

UPDATE: I just realized @jmcalabrese 's suggestion is like this:

screen shot 2017-08-21 at 4 15 07 pm

It's conceptually more clear than above, and should take less space than all 3 values in columns. However I found the main determining factor for column width is the header, so tau position(CI low, CI high) is going to take a lot of space (even more than the 3 values cell above).

xhdong-umd commented 7 years ago

This is a table without alternating row background (striping), with low and high rows in gray. I'm not really satisfied with this format since it's still too overwhelming. There are some duplicated information in the low/high rows. I also have to use a color_target row which is redundant in information but needed for coloring.

screen shot 2017-08-21 at 9 24 28 am
xhdong-umd commented 7 years ago

@chfleming I tried this but there are still some problems.


# plot location with UD, using multiple level.UD values will have error
plot(tele_list[[2]], UD = selected_hrange_list[[2]], level.UD = c(0.50, 0.95))
# the condition has length > 1 and only the first element will be usedlonger object length is not a multiple of shorter object lengthnumber of items to replace is not a multiple of replacement lengththe condition has length > 1 and only the first element will be used

# plot UD itself with multiple level.UD values will have error will have warning, but the plot is generated
plot(selected_hrange_list[[2]], level.UD = c(0.50, 0.95))
# longer object length is not a multiple of shorter object length number of items to replace is not a multiple of replacement length
xhdong-umd commented 7 years ago

I changed the canvas height default to 600, updated the title of occurrence plot level input.

I also replaced the NA values in model summary table to empty cells. I think it should be less cluttered compare to a explicit NA value.

screen shot 2017-08-21 at 12 24 06 pm

I'm still not satisfied with the model summary table. I'll look at the "3 values in one cell" idea.

vestlink commented 7 years ago

A digression: Have you noticed when you try to sort selected outliers, sorting is not consistent.

distance_center do not sort correctly as far I can see

distance to center
vestlink commented 7 years ago

It looks like sorting is independent of animal ID for all but the distance_center

xhdong-umd commented 7 years ago

@vestlink I'm not sure what did you mean. User can sort the table by clicking the arrow in each column. There are reasonable reasons for several sorting choices, so I'm expecting user to sort by themselves. And the screenshot shown is in speed tab which will pick points by speed.

The sorting is still not perfect since the values are text with the units instead of numbers, so 9.9 will be sorted as larger than 10. The only way I can think of to fix this is to have the values and units in separate columns.

xhdong-umd commented 7 years ago

I added a column of SI values so it can be used for proper sorting.

screen shot 2017-08-21 at 2 04 59 pm
xhdong-umd commented 7 years ago

@chfleming Here is a reproducible code with data for the warnings of home range plot with CI levels.