Outlier detection - Githubissues

chfleming commented 7 years ago

Between the data importing step and the time subsetting step there eventually needs to be an optional outlier detection step, where users can remove gross outliers and re-project the data if necessary.

There are two obvious ways to facilitate outlier detection. One is to sort the data by distance from median and allow the user to flag the most extreme locations---perhaps by coloring them in red on a grey scatter plot.

Another method is to sort the data by speed estimates. You have the mid-point speed estimates: sqrt(diff(x)^2+diff(y)^2)/diff(t), which you can sort and allow users to flag extreme movements that leave the surrounding data and come back. In this case you would want to color the bulk of the data in grey, a subset of the data containing the fast time in one color (so that the user can see continuity in the data if it exists), and the 1-2 times adjacent to the fast mid-point in another color.

With both methods, you would want to allow the user to iterate from the most extreme times, flagging individual times that look bad. Then at the end there could be an optional button to re-project the data, which can be necessary if the outliers are on the wrong side of the globe.

A good set of data to practice this on is Paul Cross' older buffalo dataset on Movebank, because it has both kinds of outliers.

xhdong-umd commented 7 years ago

Is the dataset 1 below the one you mentioned?

dataset 1: Study - Kruger Buffalo, VHF Herd Tracking, South Africa Movebank ID 1760349 2000 - 2006

dataset 2: Study - Kruger African Buffalo, GPS tracking, South Africa Movebank ID 1764627 2005 - 2006

For the arrangement of UI, I'm thinking to add two tabs in data page, animal location plots, between 3 and 4, named as "outliers in location", "outliers in speed".

For selecting the outliers, will zoom + brush (use mouse to drag a rectangle) in plot be enough? With every selection, click a button to add them to list, and the selected points will have a different color.

For speed outliers, we can use color gradient groups for different speed ranges, like the time subsetting page. We can also map the speed to some characteristics of points, like size of a circle. Though that may not suit for dense points patterns. Color gradient groups should be enough.

We may also need a log scale for speed ranges since the outliers could be much bigger than the normal speeds.

These speed are between the points. So when we coloring the points in plot, for any point i, we actually have 2 speed of i-1 to i and i to i+1. Maybe we should just assign the speed i to i+1 to point i? The ending point will have no speed value.

If the points are not too dense, an interesting visualization will be drawing arrows from points to represent the move direction and speed. I'm not sure if it's easy to do though.

chfleming commented 7 years ago

You want the GPS data, not the VHF data. It will look like the clean buffalo data, but with many duplicate entries and half a dozen outliers.

I would assign the average speed to the in-between time location and then the lower and upper speeds to the first and last times.

xhdong-umd commented 7 years ago

If using average speed, we actually smoothed the speeds with a sliding window. Will this make the outlier less obvious?

chfleming commented 7 years ago

A sliding window average will make the outlier less obvious.

xhdong-umd commented 7 years ago

This task could need a lot of space so it will need a separate page. I'm thinking add a sub menu under Import Data, and add a button in Import Data page. So user can skip this step if it is not needed, and he can also take this extra step, then the Visualization page will take the filtered dataset.

xhdong-umd commented 7 years ago

It seemed that when you added sub menu in Shiny dashboard, clicking the parent menu will only fold/unfold the submenus instead of opening some page. So I have to make menu like this:

Data
- Import Data
- Filter Outliers

And the menu is not expanded by default. Currently Shiny Dashboard doesn't support expanding menu programmatically unless I use some javascript.

xhdong-umd commented 7 years ago

I think the outlier should be defined for each animal separately, so the detection procedure should start on individual animal. We better have an overview mode to let user know which animal have outliers to pick in a glance. There are actually quite some similarities with current plots, like the histogram facet in time could be adapted to distance histogram facet, and the time subsetting page format could be used on selecting outliers.

I would like to incorporate this into current UI in a clean and logical way, also reuse current code if possible. This will need more considerations and experimentations.

chfleming commented 7 years ago

Yeah, outlier detection should definitely be per animal.
Later on when we have maps overlays implemented, that will help for animals near water/land boundaries.
Also, some way to visualize the HDOP/HERE column (if imported) could be helpful as high HDOP locations should be allowed to be a little more erroneous.

Skip these last two points for now though.

xhdong-umd commented 7 years ago

OK. I added the last two points to a note in project plots so we can revisit them later.

xhdong-umd commented 7 years ago

I'm wondering if we need a cleaning process for obvious mistakes in data.

In calculation of speed, the old buffalo data have 3 rows of diff_t as 0, thus get speed as Inf.

We can

remove the rows in all analysis as a data cleaning step. There are also some common errors can be cleaned. Though sometimes the rows may only have mistake in one column, the other columns are still valid. This approach could be too aggressive.
just ignore the rows in speed outlier detection. However this actually excluded them from speed outlier detection. It's possible that they are actually speed outliers but we cannot know the speed unless we interpolate the time according to their previous and next rows. I think interpolate actually should work in many cases, but it also made some assumptions that difficult to verify.

In any case, the problematic rows should be few and removing them should not be a big problem. I just want to discuss the cleaning issue in general. Do we have a general approach for these cleaning problems?

xhdong-umd commented 7 years ago

Another thing I want to make sure about it:

The data returned by as.telemetry should have all records sorted by timestamp for each animal, right? I'm assuming they are always sorted, and I don't want to sort them in app. That could create some differences between my data frame version of data and the ctmm object version of data.

chfleming commented 7 years ago

Right now I remove duplicate rows of the data.frame, but not duplicate times. Duplicate times are actually okay with an error model (but not okay without an error model), as they can give you more information about the error model parameters and the location at around that time. But removing duplicate (or nearly duplicate) times can also be a useful option for people that have otherwise regular & spread out times and don't have any other incentive to use an error model except to handle the duplicate times.

So I would include duplicate times in the regular cleaning process, but perhaps with an optional button/checkbox to discard all duplicate times. We can insert some interactive documentation regarding the interplay of this choice and error modeling at a later stage.

The timestamps should come out ordered. There is also a lot of code (now and future) that has to interpret different named/formatted columns. Error, in particular, can take 4 different formats (2 now supported) and have a dozen different names.

I don't follow your comment on interpolation.

xhdong-umd commented 7 years ago

By interpolation I mean the rows with 0 diff_x is having same timestamp with next row, but we can take the middle point of time of previous and next row as an estimate for this row. If the error row is few in a lot of normal rows, this should not introduce much error in speed estimate. I was thinking the rare cases of the duplicate time rows happen to be a speed outlier, and removing them will exclude them from detection. From this aspect interpolate can be useful.

For the speed outlier detection, if we don't do interpolation, the row with duplicate times cannot have a speed so they have to be excluded.

So the user option of removing some rows should be put in the modeling stage depend on the models selected, right? I'll add the feature in model project.

chfleming commented 7 years ago

Users should have the option to remove duplicated times, both individually and as a group. Whether this happens in the same stage as the speed filter is another question. The speed filter will prioritize high speed times, and duplicate times would result in an infinite distance/time speed estimate, so it seems like it would work to me.

xhdong-umd commented 7 years ago

ggplot2 will remove the infinite speed values in speed histogram by default, otherwise the axis will be too stretched. So those points cannot be selected by the histogram mouse selection operation. I think we can just mark them as outliers automatically, but notify users these cases, maybe include them in the excluded points list by default.

xhdong-umd commented 7 years ago

Here are some plots I made, and I'd like to hear some comments before I start to implement them in app @chfleming @jmcalabrese

distance to median center histogram

This facet histogram will be used to determine which animal need to have outlier filtered out. Note I limited the y axis otherwise the few outlier can be too small to be visible compared to the majority group. Basically all count above 20 are not shown in the histogram.

loc_his_bins

distance scatter plot

This is just to show the idea. With the histogram facet, we actually don't need scatter plot in facet, which took too much space. I plan to only draw scatter plot for the individual animal after user chose which animal to analysis. I'll also add zoom feature, with point size, alpha value adjustable in app. It should also use the color groups similar to the time subsetting page.

The blue point is the median center. loc_scatter_facet

speed histogram

speed_his_facet

speed scatter plot

Note some outliers are not obvious with overlap. In the app we could highlight only the high speed group and make them more obvious.

speed_scatter_facet

speed arrows

I explored if arrows for speed could be helpful. This may be helpful if the plot is zoomed in, with arrow size adjustable. We can make it an option.

Right now the arrow head size is fixed. I'm trying to make it match the speed value.

speed_arrows_zoom

chfleming commented 7 years ago

My experience is that for the speed plots it can be useful to make a plot per outlier that is zoomed in and focused on a short segment of times centered on the outlier. The buffalo are in a regime where they can potentially sprint across their home range on the scale of the sampling interval, but for other species, speed outliers can be less obvious.

Also, I think it would be nice in both cases to be able to work on a sorted array of times from most extreme to least extreme, with the current time of interest highlighted/featured in the plot.

The speed plots will definitely need units.

xhdong-umd commented 7 years ago

I'll add units to all plots.

For zoom in and highlight on time range, it was the original plan to select speed range in speed histogram, then highlight the selected points and their time neighborhoods. Current plots are static so I didn't implement that yet.

I think if the scatter plot support zoom with mouse, then we don't need to zoom in the plot for the time range automatically, right? It will also give user an overview about where are these points located. User can zoom in with mouse later.

By "a sorted array of times from most extreme to least extreme", what's the exact definition of "array of time"? Right now the histogram is based on distance/speed count, do you want a plot that have time as x axis, distance/speed as y axis? Or you mean for each single point of outlier, define a time range around it, then put into a sorted table?

We can have a sorted table for distance/speed values, then selection on each point can highlight itself and its neighbors in time in scatter plot. Is this what you want?

chfleming commented 7 years ago

Definitely for the speed filter and probably also for the location filter, I think there should be a numeric input widget where INTPUT runs from 1 to the length(DATA$t) with the up/down arrows incrementing by 1, where an input of 1 corresponds to the most extreme time and an input of length(DATA$t) corresponds to the least extreme time. Specifically, if you take the array of distance from median or instantaneous speeds, DIST, then the input would be fed into SORT<-sort(DIST,method="quick",decreasing=TRUE,index.return=TRUE)$ix so that SORT[1] is the index of the most extreme time DATA$t[SORT[1]]. This location would then be highlighted/featured on the plot for the user to assess.

Specifically for the speed filter, I think the plot also needs to consider a few times prior to and after DATA$t[SORT[INPUT]], color this subset of times, highlight the INPUT time, and grey out all other times not in the subset. Zooming, automated zooming, and arrows can also be considered. The issue here with the speed filter, is that with the other data around it can be very hard to see speed outliers that remain within the bulk of the data.

chfleming commented 7 years ago

Also, I think the outlier filtering should be one animal at a time. @jmcalabrese might have an opinion on this too.

xhdong-umd commented 7 years ago

The histogram facet is to give user an overview and hint which animal may need outlier filtering. Then user choose one animal to start the real filtering process, which will be quite similar to the time subsetting page, with a histogram linked with a scatter plot.

Shiny doesn't have a input control with up/down arrows. We can have a sortable table like this, clicking one row will highlight the point.

timestamp	distance
2005/01/01	1000
2005/02/03	2000

Though I'm not sure if that is efficient when there are thousands of points. User can select a range in the histogram and highlight the point. That range is not by single point which can be pro or con based on your user case.

For the speed outlier, my plan was to highlight selected time in a color, neighborhood points in another color, and all other points gray. If adding arrows, I'll also only add arrows to these selected subsets.

I probably didn't explain my plan clear enough in previous comments. Now with the desired operations nailed down I will start working on the app. Once we have the app it will be much easier to discuss the behaviors.

chfleming commented 7 years ago

"Numeric Input" has up/down arrows (though they are ridiculously tiny on my webbrowser). Your idea of a sorted table is probably better though. It wouldn't hurt to have all three columns: timestamp, distance, speed, as outliers of distance are also likely to be outliers of speed. Maybe clicking on the distance/speed would sort the table by that quantity, and switch between distance/speed filtering approaches. Maybe that's too complicated, though.

xhdong-umd commented 7 years ago

You are right, I didn't even find the arrows. They are so tiny and only appears when I clicked it.

I'm not sure if we should mix the distance and speed outlier detection in same page. There are a lot of similarities, but keep them separated seemed to be more straightforward.

xhdong-umd commented 7 years ago

Are these speed values normal? I saw some 400 km/day values seemed to be valid data.

These speed are measured per hour, so the km/day value is assuming same speed for 24 hours. I'm not sure if this is the optimal unit we should use. Besides, I knew the normal speed range of human walking, running, cars, planes in km/hour, but I don't know them in km/day, so I cannot estimate if the speed value seemed to be valid based on those numbers directly.

speed_his_unit

chfleming commented 7 years ago

I don't know that the units need to be converted from m/s here, but just that the units needed to be given on the axes. For judging high speeds, I would think m/s or km/hr would be easiest for people.

400 km/day is not entirely unrealistic, depending on the species. Coarser data actually makes the distance/time estimate too small (when above the scale of error).

xhdong-umd commented 7 years ago

Oh I just used the ctmm::unit function to pick the best unit. So for speed we can always use m/s or km/hr, right? Which one should we use?

chfleming commented 7 years ago

ctmm:::unit was designed for reporting realistic speed estimates from the models. I will update ctmm:::unit to handle unrealistic scales of speed and you can keep your code as is.

xhdong-umd commented 7 years ago

OK, thanks! This is not in a hurry so please take your time. I don't want to interrupt your normal work flow.

jmcalabrese commented 7 years ago

Ok, getting up to speed with this discussion...

I agree that per individual outlier filtering is necessary
I like both the distance and speed facet histograms for identifying outliers
For infinite speeds (0 dt), perhaps adding a notification label on the panel for that individual would be an easy way to incorporate that info in the speed facet histogram without blowing up the x axis?
I think units of km/day are probably most intuitive, at least for larger animals. For small things, m/s might be better

jmcalabrese commented 7 years ago

Looking at these facet histograms got me wondering how well all of these multi-individual plots we're using are going to scale as the number of inds gets large. Some studies are now collecting data on hundreds of individuals. What would happen to these plots in such cases?

jmcalabrese commented 7 years ago

Regarding the speed arrows, I think I would need to see some examples of their use when the plot is zoomed in on the outlier and some surrounding times. I didn't find zoomed out view shown above useful as it is way too busy.

xhdong-umd commented 7 years ago

The studies with lots of individuals do have some challenges.

For the outlier detection

If the decision can be made based on histogram alone, then histogram don't have to be per individual. We could just draw one histogram and let user exclude points.

Then we can have a scatter plot highlighting the selected points in one color, all the points from same individual in one color, and the median center in one color.

For the general visualization

There are some facet plots become unusable with lots of individuals too.

we can add a control to select a subset of individuals in the data summary table. A range slider is easy to use but don't have fine control of specific individuals. Selecting rows in data summary table can have fine control but may be too cumbersome.

With lots of individuals, will user still analysis specific single individual step by step, or will he/she use batch operation?

xhdong-umd commented 7 years ago

After zoomed in and only draw related points, I found drawing paths are better than arrows:

path show the movement order, clearly show how the outlier was related to its time neighbor
the path length itself represented the speed, assuming most sampling time are same.

The overview plot with path drawn for outliers and their neighbors speed_outlier_path

Manual zoomed in on left top corner

speed_outlier_path_zoom1

On right bottom corner

speed_outlier_path_zoom2

This also show a limitation came from how the speed is defined now: for point 1, 2, i, ...n, I assign speed for i as distance(i,i+1)/dt , and the point n have no speed value.

From the plots we can see there really only have two outlier points, but the point right before the outlier get a abnormal speed value because the distance to outlier is high. @chfleming mentioned to use an average speed of i-1 -> i, i -> i+1. But I think with the outlier value like this, even the average still will be much higher than normal range, so those two point will still be marked as outliers.

I plan to have a table showing the points selected by the user selection on histogram, then user can select table rows and highlight the points. This way user can remove the 2 true outlier points manually.

I think it won't have a significant impact even user just remove them all. Besides, if user removed the location outlier first and reprojected the data, the speed outliers are gone already in this data.

xhdong-umd commented 7 years ago

I start to believe we only need one type of outlier detection. The speed method considered both distance to previous point and time passed. I would say any distance outlier will be a speed outlier, and speed outlier may not be distance outlier.

The only exception is the last point in data, which don't have a speed definition currently but could be a outlier in location. Then the point before it will have the high speed value, I can always assign the speed value of n-1 to n so the last point will be included in the process.

I'll build the app first then we can discuss base on the app.

chfleming commented 7 years ago

Thinking more about this, I think for the purposes of identify outliers the assignment of speed to sampled times should take the minimum adjacent value, like the following:

SPEED <- sqrt(diff(x)^2+diff(y)^2)/diff(t2)
n <- length(SPEED)
SPEED <- pmin( c(SPEED[1],SPEED) , c(SPEED,SPEED[n]) )

Using pmin like this will more cleanly separate the outliers from the regular data.

The only false positives then would be times 1,n, if times 2,n-1 were outliers. But that could be fixed a bit by extending the comparison

SPEED <- sqrt(diff(x)^2+diff(y)^2)/diff(t)
n <- length(SPEED)
SPEED <- pmin( c(sqrt((x[3]-x[1])^2+(y[3]-y[1])^2)/(t[3]-t[1]),SPEED) , c(SPEED,sqrt((x[n]-x[n-2])^2+(y[n]-y[n-2])^2)/(t[n]-t[n-2])) )

which would make it very unlikely for false positives, as they would require two outliers to occur.

dracodoc commented 7 years ago

If there are point i, i+1 both at far end but close to each other, the speed of i-(i+1) is small, then i, i+1 will take smaller value and looks like normal. So the big speed of (i-1)-i was not taken by any point and got lost.

This case also kind of defeat my plan of use speed only without distance. Though after i was removed then i+1 will be recognized after the reprojection, so it can work but that will be inefficient.

We should still keep distance and use it first, which is faster to select compare to speed for many cases.

chfleming commented 7 years ago

I think the likelihood of two outliers being close together should be extremely low.

vestlink commented 7 years ago

Would it be an idea to have a possibility for the user to be able to input the maximum speed the species in question is able to do?

xhdong-umd commented 7 years ago

@vestlink , the histogram will show all speed values grouped. User can select any speed value range in the histogram with mouse, then selected points will be highlighted for further inspection.

@chfleming , we need to explain how we define the speed to users. The single side speed or average speed are straightforward, but I'm not sure if user can understand why we choose pmin immediately.

Now I found the average of two sides have its merit: the real outlier will have high speed value, and the point before/after it will have a half speed value, which probably still well above the normal range. So the outliers and the points before/outliers will become different groups in histogram, make it easier to filter/select.

jmcalabrese commented 7 years ago

@xhdong-umd , correct me if I'm wrong, but it doesn't seem like it would be hard to just implement these these different definitions of speed and see which works best in practice. If we go with Chris' definition, then yes, it would have to be explained as it is not super intuitive.

xhdong-umd commented 7 years ago

Yes the speed definition is easy to change so we can try various approaches and compare them. I'll implement all of them and let's test with different datasets.

chfleming commented 7 years ago

If there are multiple outliers, with a range of deviations, then the average adjacent speed is going to mix up the ordering of outliers and their adjacents. This doesn't happen in the above example, where their deviations are comparable, but it is very common and I have seen multiple times. Minimum adjacent speed is less intuitive maybe, but it only gives false positives with back-to-back correlated outlier deviations (or back-to-back outliers on the ends), which is going to be exceptionally rare.

So in the comparison, include an example, real or simulated, with a range of outlier deviations.

xhdong-umd commented 7 years ago

@chfleming , in the extended definition of pmin speed, there will be N-1 values from diff on N points, so n = N-1. The second part c(SPEED,sqrt((x[n]-x[n-2])^2+(y[n]-y[n-2])^2)/(t[n]-t[n-2])) is actually using the speed from N-3 to N-1. Since the first part used 1 to 3, I assume you wanted to use N-2 to N, right?

chfleming commented 7 years ago

Your assumption is correct.

xhdong-umd commented 7 years ago

New updates to app on outlier detection:

The scatter plot will draw all points in color with small point size by default, any mouse selection in histogram will highlight points in range with much bigger point size. This make it easier to locate the points without need to drag the slider every time.
It seemed that if we remove a page from the sidebar it will not be visible no matter what. Now I just put both of outlier and time subsetting page to top level menu. You can switch to them by clicking on the side bar menu, or click the button in data summary of visualization page.

I have not push the new code to github repo yet. I'll wait until more features are added.

chfleming commented 7 years ago

Regarding duplicate times, if they are the result of truncation error, then it would be approximately ok to use instead of diff(t) directly,

dt <- diff(t)
if(any(dt==0)) { dt[dt==0] <- min(dt[dt>0])/2 }

and this might also work ok in general. Its certainly better than nothing.

xhdong-umd commented 7 years ago

So we assign half of minimal sampling time as time difference. If there are 3 points with same time, 2 of them will get half sampling time, then if we add dt to get the timestamp, the 3rd one will have same time with 4th point.

t1 = 1
t2 = 1
t3 = 1
t4 = 2

then

t1 = 1
t2 = t1 + dt = 1.5
t3 = t2 + dt = 2
t4 = 2

Of course I'm just inventing case here, this may be very rare.

dracodoc commented 7 years ago

I'm not sure about how duplicated times were generated.

Depends on the nature of the error, one approach is to interpolate timestamps. If 'i', 'i+1' have same time, we can interpolate 2 timestamps between 'i-1' and 'i+2' and assign to them. This can be a part of data cleaning if reasonable.

Intuitively a speed outlier should be a location outlier to its time neighborhoods, assuming even sampling time. So if the majority of sampling time is evenly spreaded, every point should not be too far from the points before and after it. This looks like become location outliers, but our location outlier is based on center of all points, not neighborhoods points.

Does it worth to calculate the distance to center of neighborhoods for all points? Or calculate this on the subset of high leaving speed? This can be seen as an extension of pmin definition.

chfleming commented 7 years ago

It is impossible for a single GPS device to take multiple fixes at exactly the same time, so, at minimum, there has to be some numerical truncation error with duplicate times, if those fixes are valid. I have seen cases where it is clear that the times were rounded off to precisely 1 minute or 1 second. I think bumping up dt to the max roundoff error would probably work reasonably well. This would underestimate the speed in the absence of telemetry error, but in this regime telemetry error is substantial and neglecting it as we are doing overestimates the speed, so some bias in the other direction from this operation doesn't worry me offhand.

If you calculated the distance from a neighborhood's center, what time would you divide by? The total period of the neighborhood? That seems like it would work, except that large gaps would mask any speed outlier in their neighborhood, rather than only being able to mask adjacent speed outliers.

Similar to your idea, (inside a for loop) incrementing backwards & forwards from the current time until you have non-zero time differences, and then taking the minimum speed from those two pairs could also work. (And if you hit an end point, then increment further in the other direction.) This is not vectorizable though, so its slightly slower.

I'm leaning towards bumping up dt to max truncation error.

xhdong-umd commented 7 years ago

I just did a little search on this issue.

This book (chapter Data Quality, about page 120 ) said records with duplicate time are definitely error, the problem is just to find out which one need to be removed.
trip package forward the time by 1 second recursively
movebank have some filters for user uploading data. It will mark the records with duplicate time as algorithm marked outlier starting from the second.

For speed outlier, it have several interesting algorithms. The valid anchor algorithm is interesting and simple, but need first point to be valid and cannot be vectorized.

ctmm-initiative / ctmmweb

Outlier detection #5

distance to median center histogram

distance scatter plot

speed histogram

speed scatter plot

speed arrows

For the outlier detection

For the general visualization