Open chiraganand opened 1 year ago
DataFrames.jl takes advantage of data being sorted. The issue is as follows:
julia> @time outerjoin(ts1.coredata, ts2.coredata, on=:Index, makeunique=true);
0.269071 seconds (257 allocations: 411.998 MiB)
julia> @time join(ts1, ts2; jointype=:JoinAll);
0.587725 seconds (355 allocations: 1.310 GiB, 17.48% gc time)
So the major problem is that the total cost of the join is not in outerjoin
, but in TSFrame
constructor invoked later.
Now your joinnew
function should be:
function joinnew(index1, index2)
i = 1
j = 1
last1 = lastindex(index1)
last2 = lastindex(index2)
while (i <= last1 && j <= last2)
if index1[i] < index2[j]
# push!(result, (index1[i], missing))
i += 1
elseif index1[i] == index2[j]
# push!(result, (index1[i], index2[j]))
i += 1
j += 1
else
# push!(result, (missing, index2[j]))
j += 1
end
end
while (i <= last1)
# push!(result, (index1[i], missing))
i += 1
end
while (j <= last2)
# push!(result, (index2[j], missing))
j += 1
end
end
and then the timing you get is:
julia> @time joinnew(index(ts1), index(ts2));
0.016491 seconds
which you probably wanted.
What we could do:
outerjoin
, leftjoin
and rightjoin
that keep sorting order (now they store non-matching rows at the end of the data frame, so you need to sort it later). This would improve things.:Index
column? Because it would be another optimization. The point is that TSFrames.jl has many constraints that DataFrames.jl cannot assume, and using knoweledge of these constraints will speed up operations.Is this clear? (and if yes how would you want to move forward with the issue?)
In TSFrames.jl the join function looks like:
In the examples below the R function performs better than
TSFrames.join()
by a factor of almost 6x.R
Julia
I tried writing a simple
while
loop for joining two TSFrame objects but only going through that loop and doing all theif-else
comparisons 10million times takes more than 2 seconds without storing the results. I was assuming the algo below is taking advantage of the fact that:Index
column is already sorted so you can go through both the indexes sequentially which should save a lot of computation.