Flattening the output of interaction queries in R bindings

nathaliestroeymeyt commented 4 years ago

This is just summarising what we discussed on slack last night. I suggest the following:

fmQueryComputeAntInteractions returns a list of two objects, as currently implemented: a $trajectories object that contains the trajectories for the time window specified, and an $interactions object that contains the interactions for the same time window
the $interactions object itself would be a list of two objects: a dataframe $interaction_summary and a data.frame $interaction_detail :
$interaction_summary would be a data.frame containing the following columns:
- interaction_index (basically equal to 1:nrow(interaction_table) ) giving a unique ID to the interaction
- ant1
- ant2
- time_start
- time_end
$interaction_detail would be a data.frame with the following colums:
- interaction_index (has to match the interaction_index in $interaction_table)
- antid
- time
- X
- Y
- Angle
- Zone
- Capsules (all capsules involved in the interaction for that ant at that frame, e.g. if both the head (capsule 1) and the thorax (capsule 2) were interesecting with any part of the other ant in that frame this would be a string "1-2" (the "-" will help us separate the capsules in later analyses)

So, if an example interaction i lasts 10 frames - and assuming that both ants were detected continually during these 10 frames, this would lead to a single line in interaction_summary, and 20 lines in interaction_detail (one line for each frame and for each ant) - and all these lines would share the same interaction_index i

If another interaction lasts 10 frames, but this time one ant was only detected for 5 frames during these 10, then this would lead to 15 lines in interaction_detail (10 frames for the first ant and only 5 for the second)

As both objects are data.frames, they would be easy to manipulate and summarise with R built-in functions - so a user could easily either calculate the mean location, or the majority zone, or the start location, (or anything else they might be interested in) for each interaction without iterate over a list with a for loop, which is very slow in R. Having all the capsules involved for each ant at each frame would also allow the user to get more detailed information if they are interested in the directionality of the interaction - or the proportion of the interaction in which there was a close contact.

I think this could solve both issues of (i) us wanting data.frames, and (ii) you not being too happy about including mean locations by default as the requirement would vary per user.

This would also be super useful to have sooner rather than later as we are going to delve into Adriano's data soon

atuleu commented 4 years ago

Yes I aggree, it would be easier in R as it does not like nested hierarchy of data.

BTW: this representation is as complex than the current one, as it bears the same data complexity. However it is a flatter design.

I may use different name, but I agree that would be a good represnetation

nathaliestroeymeyt commented 4 years ago

PS - it would still be good to have the option of whether or not to output the second object (as is the case now), to save time/computation power for users who are not interested in the detail

atuleu commented 4 years ago

The option already exist: reportTrajectories

nathaliestroeymeyt commented 4 years ago

Yes I know it already exists...I was afraid that my initial description might suggest that I always wanted the trajectory detail to be ouput, so I thought it would be safer to add that I think it's a very useful option and that we should keep it in :-)

atuleu commented 4 years ago

don't worry. Its here because there is a real cost: it almost doubles the query time, not to compute it, but just trasnfering it to R.

since you will, I bet, never compare interactions trajectories between two interactions (they are even not never time related!), I would suggest the following structure:

res <- list( "summary" = data.frame("ant1","ant2","start","end"), "types" = list(matrix()), "ant1Trajectory"= list(data.frame("X","Y","Angle","Zone")),"ant2Trajectory"=list(data.frame("X","Y","Angle","Zone")))

with such a result, you would use whoch to get access all interactions for ant 42:

interactionsRowIndexes <- which(res$summary$ant1 == 42 | res$summary$ant2 == 42)
interactionSummary <- res$summary(interactionRowIndexes)
interactionsTypes <- res$types(interactionsRowsIndexes)
ant1Trajectories <-res$ant1Trajectory(interactionsRowIndexes)
ant2Trajectories <- res$ant2Trajectory(interactionsRowIndexes)

Now if you want to prune down to only when 2-2 interaction exists

containsInteractionTypes <- function(types, type1,type2) {
  sum((types[,1] == type1 && types[,2] == type2) || (types[,1] == type2 && types[,2] == type1)) > 0
}
idxs <- interactionTypes(which(lapply(interactionsTypes,containsInteractionTypes,wanted1,wanted2)))
finalSummary <- interactionSummary(idxs) # only for wanted ant, and wanted interaction
finalTypes <- interactionTypes(idxs)
...

Means can easily be computed from R and added to the summary

summaryWithMean <- cbind(summary,do.call(rbind,lapply(ant1Trajectory,colMeans)),do.call(rbind,lapply(ant2Trajectory,colMeans)))
names(summaryWithMeans) <- list("ant1","ant2","start","end","mean_x_1","mean_y_1","mean_angle_1","mean_x_2","mean_y_2","mean_angle_2")

atuleu commented 4 years ago

So TLDR: instead of repeating the row informations as a column (the information is here, its the row of the data frame!), which can be a lot of infromation for large dataset, we just amke sure we have list with matching index than the summary data.frame

Also I think that we should separate everything in 3 list. One list for kinteraction types, always returned , and two list of data frame, for each ant trajectory.

I still believe that reporting interaction type as a two column matrix is the better. we can write utility function like the example above to efficiently test for an interaction type. For example to test for interaction involving type 42 in any of the ants:

interactsWith <- function (types,type) {
sum(tyeps[,1] == type || types[,2] == type) > 0
}

Such function are easy to make and are a good design, and will be more efficient than a string search.

nathaliestroeymeyt commented 4 years ago

I see why you are suggesting this, but I'm not sure it is as flexible for our usage than my suggestion.

Let's say I want to select all interactions that lasted longer than 10 minutes - with my version, I would write:

long_interaction <- interactions_summary[which(interactions_summary$end-interactions_summary$start >= min_time),]

And that would be it - I could then create my network in the next line. With your version there would still be a need to iterate over the list...right?

I could want to subset my inetraction list over many criteria - for example, if I wanted to keep only interactions that remained within a small area, I would write:

to_exclude <- unique(interaction_detail[which(interaction_detail$X <Xmin | interaction_detail$X >Xmax | interaction_detail$Y<Ymin|nteraction_detail$Y>Ymax),"interaction_index"])
interactions_subset <- interactions_summary[which(!interactions_summary$interaction_index%in%to_exclude),]

Basically, with R, to make use of the very efficient subsetting functions, it is crucial for the data to be in data.frame forms - and a succession of all interactions in a single dataframe is way easier to handle than a list, even if it is bigger

nathaliestroeymeyt commented 4 years ago

Actually my first example is not relevant since you are already suggesting to return interaction_summary as a data.frame

nathaliestroeymeyt commented 4 years ago

But another example of why I would prefer a data frame with all trajectories:

If I want to extract the mean X and mean Y of all interactions, it would be a single liner withb a data.frame: coordinates <- aggregate(cbind(X,Y)~interaction_index+antID,FUN=mean,data=interaction_details)

I do not know how to do it so easily and so quickly over a list of trajectories.

nathaliestroeymeyt commented 4 years ago

And the reason why I would like the interaction_index in a specific column within the interactions_summary is that many R functions result in reshuffling the lines, and then the row numbers would not match the order in the trajectories annymore...so I see that as a safety nets for cases where the summary table rows will be reshuffled

atuleu commented 4 years ago

Adding row number to summary if you need it is a one liner in R:

summary$id <- seq(1,length(summary))

With how data frame are constructed in S (R is based on the S language), to add this later as no impact on performances, apart from the memory use that could already be critical in some application. So I prefer the end user to add this one liner if he needs to... And let the possibility to not add it if

And I still believe from my experience with other tool where this issue could happen, that you do something wrong when you loose that information, and function that reorder your dataset most often gives you the reordering vector so you don't loose correspondances to your original vector.

Applying a function to every element in a list is very easy, using lapply that can also takes an anonymous function. Getting the mean X Y for an ant is as simple as :

lapply(ant1Trajectory, colMeans)

The advantages is that you if you want to find the trajectory for a perticular interaction, my solution is O(1) (that means instantaneous), and yours is O(n). That means instantaneous when there is 10 interactions, and grows linearly with the number of interactions, and could goes up to several second for fetching a single interaction.... So I do not see the point to aggregate all unrelated trajectories in a big data frame, you are just less efficient, and there are efficient R method that let you manipulate data in list, if your list have all the same kind of objects...

I will test the two methods with mock data, but my bet is that since lapply knows exactly the output size it needs, it should be faster. The aggregate method cannot know in advance how many interactionId / antId exists in your dataset, and need therefore to perform a first steps to determine this information.... Again small optimizations, but when added up on large dataset, it means the differrence between minutes or days of computation.

nathaliestroeymeyt commented 4 years ago

Out of experience, lapply is much much much slower than aggregate. The reason being (I think) that aggregate is based on ann underlying C++ for loop whereas the lapply is based on an underlying R loop. In my past codes anything that used lapply very quickly became unmanageable from a processing time point of view - hence why we prefer aggregate and dataframes. There are also many other optimised functions available for dataframes that simply don't have equivalent for lists

atuleu commented 4 years ago

I strongly disagree with you:

library(rbenchmark)
generateMockData <-  function( nbInteractions = 10 ) {
    print("building ant1")
    summary = data.frame("ant1" = floor(runif(nbInteractions,1,999)))
    print("building ant2")
    summary$ant2 = floor(runif(nbInteractions,summary$ant1,1000))    
    print("determining lengths")
    lengths = floor(runif(nbInteractions,100,2500))
    endIndexes = cumsum(lengths)
    startIndexes = c(1,endIndexes)
    largeDFSize = 2*sum(lengths)

    res = list("summary" = summary)

    print("Building interactions IDs")
    res$asLargeDF = data.frame("ID" = c(rep(1:nbInteractions,lengths),rep(1:nbInteractions,lengths)))
    print("building ant rows")
    res$asLargeDF$ant = c(rep(summary$ant1,lengths),rep(summary$ant2,lengths))
    print("building X positions")
    res$asLargeDF$X = runif(largeDFSize,0,1920)
    print("building Y positions")
    res$asLargeDF$Y = runif(largeDFSize,0,1080)

    print("building list 1")
    res$asDFList1 = lapply(1:nbInteractions,function(i){
        res$asLargeDF[startIndexes[i]:endIndexes[i],c("X","Y")]
    })
    print("building list 2")
    res$asDFList2 = lapply(1:nbInteractions,function(i) {
        res$asLargeDF[sum(lengths) + (startIndexes[i]:endIndexes[i]),c("X","Y")]
    })
    return(res)
}    

meanFromSeparateList <- function(list1,list2) {
    summaryFromList <- data.frame(do.call(rbind,lapply(list1,colMeans)),do.call(rbind,lapply(list2,colMeans)))
    names(summaryFromList) <- list("mean_x_1","mean_y_1","mean_x_2","mean_y_2")
    return(summaryFromList)
}
meanFromLargeDF <- function(largeDF) {
    aggregate(cbind(X,Y)~ID+ant,FUN=mean,data=largeDF)
}

data <- generateMockData(25000)
benchmark(meanFromLargeDF(data$asLargeDF),meanFromSeparateList(data$asDFList1,data$asDFList2),replications = 10)

Yields:

id	test	replications	elapsed	relative	user.self	sys.self	user.child	sys.child
1	meanFromLargeDF(data$asLargeDF)	10	696.276	30.176	649.610	46.332	0	0
2	meanFromSeparateList(data$asDFList1, data$asDFList2)	10	23.074	1.000	23.037	0.024	0	0

If we state M = number of all trajectory point in all interaction, and N the number of interaction. Please note that M will be one to several order of magnitude bigger than N with real data.

By definition, aggregate, complexity is O(log(M)) + O(N): It first have to order column by antID and Interaction ID, this is theoreticaly an O(log(M)) operation, using quick sort. further more it have to perform two sort, and use the stable version, which still have a complexity in O(log(M)), is almost an order of magnitude slower that the unstable version of quick sort. I do not know the algorithm they actually use, but I am talking about a 40 years old standard which gives the best result (you can always do marginally better in perticular cases, but never really outperform quick sort).

My solution :Simply O(N), because each trajectory does not have so many points, and for large N, we can take the assumption than M/ N <<< N. So basically, your solution is impacted by a O(log(M)) factor. If we grow the number to large dataset, that is in practice a very big impact on performances as shown in this example.

My position will remain that for a general purpose API that should fit all cases, it is better to not cumul all data which is already sorted in a big data frame, that we will ask the computer to sort at every operation. Do not mix the data. It is often less computationally and memory intensive.

And even if the lapply method of R has a completely laughable implementation in term of performances.

nathaliestroeymeyt commented 4 years ago

Whelp - indeed that is very convincing. I never use do.call with lapply - do you think it makes any difference in processing time?

nathaliestroeymeyt commented 4 years ago

What about subsetting though? If I want to select all interactions that satisfy a certain criteria based on the trajectories, using "which" on a dataframe is very quick - but I don't see a way to do this based on a list without suing a for loop or lapply, which are slow...do you have a suggestion? You said there was something very wrong with my example code, but it would do the job quickly - what is the alternative to "which" on a list?

atuleu commented 3 years ago

Fixed in version v0.5.0 and v0.6.0

formicidae-tracker / myrmidon

Flattening the output of interaction queries in R bindings #136