high level design of model page, Variograms

xhdong-umd commented 7 years ago

We're still finishing some small details on speed definition of outlier detection, but the app itself is working well now. We can update the speed definition anytime without much change on app.

Now I think it's time to start discuss on the model page. There could be at least some new points need to be considered on top of previous design:

what kind of workflow do we want? This is a top question because the dashboard structure is mainly linear. If we want to go back and forth in some steps then we need design very carefully.
the data may have one or more individuals. How do we deal with this in modeling, reporting etc?
do we need make plots in ggplot too? If the effort needed to make all kinds plot in ggplot is manageable, then it should worth the time. The plots in app will be more consistent, and we should have more control in all aspects of the plot, and can get higher quality plot outputs.

There are lots of detailed issues recorded in the project page, though the above 3 questions are more important.

vestlink commented 7 years ago

My suggestion is to keep plots to a minimum and only if they help guiding the user. I would stick to one plotting solutiong that best suits the app and user and not give the user too many options. It will just be confusing.

vestlink commented 7 years ago

I assume you have seen this https://github.com/daattali/shinyjs

xhdong-umd commented 7 years ago

@vestlink Yes I used shinyjs in my first RStudio addin. I actually try to avoid using it if not must, because I found sometimes it caused some problem.

I'm not sure about its usage in this topic, do you mean disable some elements with shinyjs when they are not needed?

As for app usage, I expect the final app to have some tooltips or hints, and user should go through some brief tutorials. There are some assumptions need to be conveyed to users. Once user knew about them, they should found them reasonable and easy to remember.

We are still focusing on basic features and design so the help part is not in priority now.

xhdong-umd commented 7 years ago

@jmcalabrese I saw you are not watching the project so maybe you didn't get notification unless was mentioned. What's your plan about the models page?

chfleming commented 7 years ago

The basic workflow of the package for model fitting is ctmm.guess() to generate a GUESS object (which has the same format as a ctmm fit object, but is used as the initial guess for numerical optimization), then ctmm.select() and then to look back at the results with a plot of the variogram and selected model.

With multiple animals, I have had many users do GUESS <- ctmm.guess(...,interactive=FALSE) to avoid hands on interaction with the (variogram.fit() called by ctmm.guess()) sliders.

As a blanket work flow, I would suggest

calculating the variogram()s (and at a later stage of development, expose the dt option to users)
calculate the automated guesses from ctmm.guess(...,interactive=FALSE). The only user choice here ATM is whether or not to turn on errors with error=TRUE. This is necessary with duplicate/near-duplicate times, but excessive for coarse data.
Display the plots of the empirical variograms and initial guesses. Allow users to either move on to the model selection step or individually tweak the initial guesses as per variogram.fit().
select the models and plot variograms of the data (empirical variogram) and selected model.
give the users the option to refit the data (using the new best fit instead of GUESS) to check convergence from optim.

xhdong-umd commented 7 years ago

I tried to write some functions for ggplot version plots.

For location plot of telemetry object, we can have a function like gplot(obj, subset = "all"), which will draw a plot similar to the first overview plot in web app. The subset parameter can be a vector of identity index so you can choose to draw animal 3,4 with all data as background like gplot(obj, c(3,4)), and the color will be consistent with web app.

This is doable but I'm not sure if it's useful. If you want more control of plot like point size, you will need more parameters which could be cumbersome. One advantage over basic plot is that you can launch a shiny app with your data for a zoomable plot like this:

# draw plot directly
gplot(buffalo, c(2,3))
# save ggplot object then launch a shiny app to zoom on it
g <- gplot(buffalo, c(2,3))
gg_zoom(g)
# another way of zoom using pipe
gplot(buffalo, c(2,3)) %>% gg_zoom

For plots used in models, like variograms, I'm inclined to just use current plots. To implement them in ggplot seemed to need a lot of efforts, and there is not obvious advantages.

xhdong-umd commented 7 years ago

For variograms we may want to plot several variograms with different fraction values together, like a facet plot. It seemed that base plots need par to arrange plots in same page.

I used gridExtra for similar task on ggplot2 previously, and there is a gridBase package that place base plots in grid viewports, which is more flexible and powerful in arrangement. However there are still some limitations and mixing two plots system can have more difficulties, for example

it may break once the window resized
the output may not be saved

So it seemed that there are two approaches:

use base plots. This doesn't need any change on existing plots in modeling, though I'm not familiar with the base graphics system.
implement plots in modeling with ggplot2. It should be more powerful and flexible, but the efforts needed could be quite substantial, as I saw there are a lot of plots for different cases. And ggplot2 definitely take some time to learn.

@chfleming , @jmcalabrese Do you have any plan about implementing plots in modeling with ggplot2? I think for current stage maybe we just use existing plots and make the app complete first, but in the future should we consider about this?

chfleming commented 7 years ago

I think base plot is perfectly fine for now.

If a zoom slider is going to be used across multiple individuals, I think instead of adjusting the fraction of the plot (which varies in scale by individual), it should adjust the range of the x-axis (lags), so to be universal across individuals. If you need code to do that on a logarithmic scale, just let me know. It would basically be like the current code, but using the smallest and largest (most extreme) non-zero lags of the population. With the x-axis fixed, using the same scale of y-axis across a population of the same species would then be possible/acceptable.

jmcalabrese commented 7 years ago

I agree with Chris that we should stick with base plots for now, and then once the workflow has been tested and refined, possibly add in ggplot2 graphics where appropriate.

I also agree that a facet-like plot where you can simultaneously zoom in/out on all the variograms that are displayed would be useful.

I can see advantages and disadvantages for both zooming by fraction, and for zooming by lag. E.g., sometimes I just want to quickly zoom into the short-lag portion of all variograms, and zooming by fraction is convenient for that. Other times, I want to look at all variograms simultaneously up to a given lag, and zooming by lag is great for that. If it's not too complicated to implement, perhaps a toggle switch that would let the user decide whether to zoom by fraction or zoom by lag would be the way to go?

jmcalabrese commented 7 years ago

After viewing the automated guesses against the empirical variograms, it would be cool if the user could click on a particular variogram where they were interested in fine-tuning the guess, and then that particular variogram would open in a new window or box with sliders like in variogram.fit(). Not sure if that is possible.

It might also be useful to have a tick box for each panel to turn on error in the GUESS object.

xhdong-umd commented 7 years ago

Previously we had an idea of generate multiple "snapshots" of variograms at different fraction values for same individual (kind of facet plot), because you can check all values with a zoom slider, but multiple snapshots at typical values can provide an overview in a glance. So the plan was to have multiple static variograms at different values plus one with zoom slider. At that time we haven't consider the possibility of processing multiple individuals at the same time.

Do we still need that "facet plots" of variogram for same individual?

jmcalabrese commented 7 years ago

I was talking about the case where you are simultaneously processing multiple individuals. If you are only working with one individual, then there would only be one panel in the plot, but you would still have the ability to zoom in/out with a slider.

I think having an overall zoom slider (whether by fraction or by lag) would scale much better to multiple individuals than having multiple static snapshots for each individual.

chfleming commented 7 years ago

I agree, especially considering that knowing the proper scale to zoom in on requires guessing at the model parameters... which is the purpose of checking this.

xhdong-umd commented 7 years ago

Yes, I just want to make sure that we don't want static snapshots anymore. So we will deal with multiple individuals at the same time by default.

xhdong-umd commented 7 years ago

@chfleming you mentioned about code that adjust the range of the x-axis (lags) in log scale. Can you put them here? We can use a range slider for this

vestlink commented 7 years ago

i see the outlier issue is closed. however, I have a question. Suppose you want to use the outlier and want to "pull" the outlier in. Maybe you would think that it would be ok to place the point between the point before and after. Would that make sense?

xhdong-umd commented 7 years ago

@vestlink You can still comment in closed issue, or even reopen it if needed. For me "close an issue" just means it's no longer the active task. The outlier issue can be quite complicated, and I'm sure we could revisit it later when we met more data and more user cases.

I copied your question to that issue and replied there. I also asked @chfleming 's opinion on this.

xhdong-umd commented 7 years ago

@chfleming What's the desired range of fraction if using a log slider? i.e. the min, max, default value of fraction.

The plot above is just showing the idea. There are several issues need to be addressed:

ideally the figure size need to adjust based on individual count. What kind of aspect ratio do you want for individual variogram plot?
the y axis is not fixed, Gabs is having different unit
with slider label digits set as 2, the left end 0.001 value is not shown completely. If I use 3 digits, the slider is like this. Is this OK?

chfleming commented 7 years ago

@xhdong-umd Instead of having 1 (relative) fraction for the entire group, I would target one (absolute) zoomed in scale for the entire group. Taking the code behind variogram.fit, with a list of variograms SVFS there would be some overarching code

# maximum lag over all variograms
max.lag <- sapply(SVFS, function(v){ last(v$lag) } )
max.lag <- max(max.lag)

# minimum lag>0 over all variograms
min.lag <- sapply(SVFS, function(v){ v$lag[2] } )
min.lag <- min(min.lag)

b <- 4 # arbitrary constant
min.step <- 10*min.lag/max.lag

and then for each individual in the list, the manipulate plot (I know this isn't Shiny code) would would look like

manipulate::manipulate( { plot.variogram(SVF[[i]][SVF[[i]]$lag<=b^(z-1)*max.lag,], fraction=1, ...) }, z=manipulate::slider(1+log(min.step,b),1,initial=1+log(1/2,b),label="zoom") )

subsetting each variogram by the same (exp/log) scale b^(z-1)*max.lag and plotting the whole (fraction=1) subset.

xhdong-umd commented 7 years ago

@chfleming Because @jmcalabrese mentioned that we could have both fraction and lag for zoom, I plan to have two tabs for them (they will be using different value and range, so reusing same slider with a switch will not work).

chfleming commented 7 years ago

@xhdong-umd The poorly named min.step is the minimum fraction, 1 is the maximum fraction and 1/2 is the default for the fraction, and then I just transformed that with the transformation between the slider variable and the fraction fraction=b^(z-1).

xhdong-umd commented 7 years ago

I just updated the repo. The variograms with zoom for fraction is like this:

You can adjust the figure height and columns in panel. I will implement the zoom for lag in next step.

By the way, I think we need to fix the y axis unit for all plots in same page, right?

chfleming commented 7 years ago

I definitely don't think its necessary to fix the y-axis necessary on this tab with the fraction zoom, because the x-axis is is not fixed here. On the other tab with the lag zoom, that might be nice. I will make sure the xlim & ylim arguments behave correctly in plot.variogram.

vestlink commented 7 years ago

@xhdong-umd look at this solution (if you're not already familiar with it). https://www.exploratory.io/ i came across it a few days ago. They have done a lot of things right. I know that it is not a shiny app, but the work flow is nice.

xhdong-umd commented 7 years ago

@chfleming I was wondering if we need to fix y-axis unit. This plot have 2 different units hm2 and km2.

@vestlink I didn't play with the exploratory software because it need at least a free trial to download. From the demo video it seemed to be html5/javascript on top of R. With html5/javascript as front end you can do anything you want in UI, so it's definitely flexible and powerful. Unfortunately that almost means implement Shiny (at least the UI part, I'm not sure whether they used Shiny or implemented R backend by themselves too), which need substantial resources in front end development and backend development.

I appreciate their UI design, but please allow me to cherrypicking some points. The demo video looks really nice, but the reality is probably not that smooth.

It give you an appearance of you can scrap any website by just pasting an URL, which is just illusion -- web scraping must be heavily customized for specific website, and keep updating to match any breaking change in that website, how can that be automatic and get clean result?
The data mangling pipleline looks like you can just shuffle the steps and apply to new data. This actually require the user to know R, dplyr, tidy data, how to convert data between wide and long etc. You cannot just arrange the dplyr pipeline arbitrarily and expect it to work. Then if you are fluent with this, using menu to pick commands and form to fill out parameters are actually cumbersome.

Basically I think it wrapped some simple commands in a nice UI, but that didn't lower the barrier of analysis that much.

Another powerful UI I liked is H2O flow, which has a flexible interface for H2O API. The UI is also built upon Javascript so they can do anything possible.

I think the things we did in this web app actually doesn't fit the exploratory workflow. Too many things are invented by ourselves for our own needs, which is quite different from the simple pipeline. I do often felt limited by Shiny, and really want something more flexible than the linear menu items, but we don't have resource to create html5/Javascript UI, at least not now.

NoonanM commented 7 years ago

@xhdong-umd I think it would be best to have consistent units across all plots. For some datasets, I've found ctmm returns different units between individuals, depending on inter-individual differences in ranging behaviour. That can make visual diagnostics challenging. Maybe the best option is to convert all plots to the most common unit across all individuals in the dataset?

xhdong-umd commented 7 years ago

The unit picking depend on the range of data. So there are two approaches:

use the most common unit, though the plot above actually have 3 in km2 and 3 in hm2.
pick unit from the biggest range - or smallest range, depend on if you want 5, 10, 1000 or 0.2, 0.01, 1.

NoonanM commented 7 years ago

Another approach might be to include two options: i) pretty units (the default output, which has the potential for differences between plots), and ii) SI units (m^2) for inter-plot comparisons.

xhdong-umd commented 7 years ago

I think the units need to be consistent. We can have a radio button to switch between SI units and a pretty units either in biggest range or smallest range, as SI units view can be useful sometimes.

xhdong-umd commented 7 years ago

I just updated the repo with the variograms zoomed by absolute lag.

I think it's safe to just use 0 as min.lag, right?

In the other hand, we could use a range slider that can define both left and right ends of a range of lag, but that seemed to unnecessary because I assume the part start from 0 is always wanted.

I noticed that the plot by lag is significantly slower than plot by fraction. It seemed that subsetting data before plot is much faster than plot then set xlim. We can actually implement zoom by lag with subsetting data too.

xhdong-umd commented 7 years ago

I looked the source code of plot.variogram about the units of y axis. With ggplot2 you can change the plot unit with a scale function, but with base plot we may have to change plot.variogram and add a parameter to provide scale, like this

plot.variogram <- function(x, CTMM=NULL, level=0.95, fraction=0.5, col="black", col.CTMM="red", xlim=NULL, ylim=NULL, 
SVF.scale=NULL, lag.scale=NULL,...)

if (is.null(SVF.scale)) {
  SVF.scale <- unit(max.SVF,"area")
}

if (is.null(lag.scale)) {
  lag.scale <- unit(max.lag,"time",2)
}

@chfleming If you feel we do need consistent scale of y axis, should I create a pull request on this change, or you want to do it by yourself?

xhdong-umd commented 7 years ago

I found my edit above on plot.variogram is not ideal. To calculate the unit outside the plot function, I still need to use the code similar to the part inside the plot function. It'll be better if we abstract the scale calculation into a function, then we can use that function to calculate the unit parameter.

I'm not sure if we should make so many changes just for adjusting units though. That being said, if we want to have an option to use SI units, we will also need some change to the plot function.

chfleming commented 7 years ago

On the master branch I've edited plot.variogram to

respect xlim & ylim arguments
subset the data according to fraction || xlim before plotting for speedup

It is correct that for variograms, xlim and ylim should both start with 0 for normal use.

I will now make an extent function for a list of variograms, and using that you can fix xlim & ylim across individuals easily.

xhdong-umd commented 7 years ago

Sorry I didn't think through about the subset for xlim. If some individual didn't have data in all the time range (for example only have data in 3 months while the xlim range is 4 months), this may result its plot having a shorter xlim, right?

chfleming commented 7 years ago

I have pushed a variogram extent method that is documented. You can now subset what you want to plot, feed those variograms (in a list) to extent, and then it will give you back xlim and ylim values just like with the telemetry and UD extent methods, so that you can make tables of plots with the same scale that is chosen to be appropriate.

Also notice the threshold argument that limits the upper CI from blowing up the plot. My choice of default at 2 times the maximum semi-variance is totally arbitrary.

jmcalabrese commented 7 years ago

I agree with @NoonanM that consistent units should be maintained across different panels, and that there should be a choice between pretty or SI units. When the choice is pretty units, I think whatever pretty units the unit function returns for the majority of panels should be imposed on the other panels.

xhdong-umd commented 7 years ago

@chfleming Thanks for the update, the idea of using extent is great, with simplified and consistent usage. I think the slowness I talked must came from the mechanism of base plot updating xlim after the plot has been generated, so it's still the same speed even if you subset the data already. The fraction plot just draw each figure with subset data, which is quicker because there is no axis redraw from setting xlim.

I updated the repo, which used the new extent function so you will need the latest ctmm github version to run it.

Right now I just make all y axis to the max range in all individuals. This make the y axis comparable and with same unit.

Compare to the screenshot in previous comments which have the y axis in different scale (even if the unit is same, the scale of y can be different). Do you think this approach is desirable? Should we use same y axis for the zoom by fraction plot too?

So if we decided to use same y axis across individuals, the unit will be consistent. However to switch it will need more change to the plot function, add units as parameters.

NoonanM commented 7 years ago

Just as a side thought, for first time users, or those unfamiliar with variogram analysis, it might be worth re-naming the tab 'Zoom by time-lag".

chfleming commented 7 years ago

I like how this looks for zoom by lag. I'm not a fan of zooming by fraction on multiple individuals, so I don't have a preference on what that should look like--@jmcalabrese

xhdong-umd commented 7 years ago

@NoonanM I think user may need a hint or help page to understand the meaning of x,y axes. But for the zooming itself, it's based on a percentage of time lag, so I'm hesitating to add too many information in the title.

Now I looked at the plots again, the two modes are actually both based on percentage of x axis. The original idea was using some absolute lag value, but in implementation the absolute value will change by data so I used a percentage slider.

Now the zoom by lag is the percentage of max x axis range applied to all individuals, zoom by fraction is the percentage of x axis for each individual itself. So they actually mean zoom in same time lag range and zoom in relative percentage of time lag range.

How about naming the tab as

absolute zoom
relative zoom

and put label for slider as

fraction of max overall time lag
fraction of time lag on each figure

I wanted to use percentage but don't want to add % to slider labels, which make it difficult to interpret for the smaller 0.001 value in second plot.

xhdong-umd commented 7 years ago

Since the two plots are very similar, and the slider value is both a percentage/fraction, we can actually combine them into one page, use a radio button to switch between absolute zoom and relative zoom.

I updated the repo, it's now like this:

I'm not sure if the slider range is optimal though. Do you want a regular instead of log scale slider for the absolute lag plot?

xhdong-umd commented 7 years ago

Summary of questions:

any comment on naming of the plot, slider labels etc
should we use same y axis range for relative fraction plot?
is the log scale slider for absolute lag range optimal? It's easier to use same slider for both plots. It's a little bit more complex but doable to change the slider dynamically based on the radio button, if you feel the need to use different slider range and scale.
to add a switch for SI units, we may need to add unit parameter to plot.variogram function.

xhdong-umd commented 7 years ago

I updated the repo, now with a checkbox to overlay the fit from ctmm.guess on plot. Should we use a bigger default figure height value?

xhdong-umd commented 7 years ago

To switch on error, do we have to fit model once without error, then enable error and fit 2nd time?

# default model guess
GUESS <- ctmm.guess(DATA,interactive=FALSE)
# first fit without telemetry error
FITS <- list()
FITS$NOERR <- ctmm.fit(DATA,GUESS)
# second fit based on first with telemetry error
GUESS <- FITS$NOERR
GUESS$error <- TRUE
FITS$ERROR <- ctmm.fit(DATA,GUESS)
# model improvement
summary(FITS)

chfleming commented 7 years ago

The logarithmic scale is important to be able to work down to the short lags without excessive effort.
On the "Absolute range", it looks like the vertical extent isn't being recalculated when you zoom in.
If the data don't have tiny diff(t) then fitting first without error and second with error helps with speed and convergence of fitting. It is not necessary if the optimizer is working correctly. I would not worry too much about it, because the new optimizer that I have written is working very well and will not require this.

xhdong-umd commented 7 years ago

The y axis is updating with zoom now. I have to subset the data by xlim, then use extent to determine proper xlim ylim.

I have updated the repo. Note the "adjust guess parameters" has not been implemented yet.

xhdong-umd commented 7 years ago

I'm working on implementing the manual fit interface in Shiny. There are lots of coupled relations on units, model parameters so I want to be very careful.

@chfleming The variables sigma tau1 etc inside the manipulate::manipulate() call refer to the slider value, even there are variables of same name outside the call (because R CHECK CRAN insisted on this?), right?

This line seems changed tau1 slider value and tau1 variable by CTMM$range. Does that mean the tau1 slider can be removed if CTMM$range is FALSE? Similar thing happened on line 1020 too.

If I don't have the requirement to satisfy R CHECK CRAN, can I remove the declaration of these variables safely?

z <- NULL
tau1 <- 1
tau2 <- 0

chfleming commented 7 years ago

@xhdong-umd Yes, all of those initial (useless?) declarations were to satisfy CRAN.

For your considerations, you can assume range=TRUE everywhere. What range=FALSE does is actually fix the first tau to Inf. But you can't do home-range estimation with those models and so I don't think they are worth including in the app.

Also, you can consider CPF=FALSE. I need to remove all of that code.

To summarize the code:

variogram.guess() does some simple calculations with the variograms and outputs initial parameter estimates.
The sliders then allow users to adjust these estimates (and activate error) in case varigoram.guess() did a bad job. Unit conversions are in place to make the slider values meaningful.
Manipulate doesn't allow me to change the slider limits on the fly, so if you run out of space, you can save the output and re-run with that output as the starting value. I don't know what options are available for you.

For future models, there is also some code in ctmm.guess() that estimates some other parameters from the data. Don't worry about that for now.

xhdong-umd commented 7 years ago

I have copied your code and modified it to Shiny version, now the slider is initialized correctly, I just need to made correct conversion for the slider value and feed them to the CTMM object.

By "change the slider limits on the fly", you mean after the initialization, you may still want to change them based on some user input (one slider limit depend on other input)? With Shiny you can change the value, min max etc when needed. I'm not sure when and how to change it though.

chfleming commented 7 years ago

Sometimes variogram.guess is so far off that the default slider max is insufficient. This is usually because of substantial telemetry error, which complicates the initial shape of the variogram.

It would be nice to either:

Be able to run a parameter to the end of a slider and have the slider max double automatically (perhaps on release of the mouse click).
Be able to click a button and have all slider maxima reset to twice the current parameter values.

whichever is easier to implement.

ctmm-initiative / ctmmweb

high level design of model page, Variograms #17