Open 2005m opened 3 years ago
It is a known bug. I suggest to convert to numeric. At least until the correction of this bug.
Thank you for the feedback. Any idea when these 2 bugs will be fixed?
Your example is not correct. R's order has as defualt for NAs to remove while nth not.
Rfast::nth(x0,10,descending = T,index.return = T,na.rm = TRUE)
order(x0, decreasing=TRUE)[10]
As you can see, the result is again false! This is because you have dublicate values in you vector. Rfast::nth
is not a stable algorithm while order is. This is not wrong because the index result will give you again correct values. You can test it using Rfast::Order
which lets you decide which algorithm to run, stable or not.
Rfast::Order(x0,stable = T,descending = T)[10]
Tip: Rfast::Order
seems to be slow compared to R's. Just test them both using a vector length of 10^7
. Rfast::Order
will run while R's will choose another algorithm leading to be slow…
It will take some time to correct the bug because I have found a quite faster implementation and I am planning to replace it. Be patient.
If you are interested, there is a faster implementation in my package kit
and it behaves exactly like base R order
with NA. You can check the tests folder.
In the last month I have heard it many times. I know about topn
function but it doesn't behave exactly like R's. It does what it says. Get the top elements and specifically get the 1000 top elements. So, your algorithm is a partial sorting algorithm and to be exactly fair you must compare kit::topn
with Rfast::Order
or Rfast::Sort
. It is unfair to use Rfast::nth
because it misses a step according to your algorithm so the results will not be the same.
x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 100)
milliseconds
min mean max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 12.870 18.62030 56.023
b <- kit::topn(x, k) 5.707 6.74043 11.686
> all.equal(a[1:k],b)
[1] TRUE
> x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 100)
milliseconds
min mean max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 1.3546 1.91913 9.6779
b <- kit::topn(x, k) 15.5001 29.01403 124.9544
> all.equal(a[1:k],b)
[1] TRUE
> x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 10)
milliseconds
min mean max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 109.5838 136.65396 152.9554
b <- kit::topn(x, k) 33.5409 41.97665 69.8838
> all.equal(a[1:k],b)
[1] TRUE
> x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 10)
milliseconds
min mean max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 114.0500 154.9798 325.5320
b <- kit::topn(x, k) 85.4553 154.9910 258.6717
> all.equal(a[1:k],b)
[1] TRUE
Unfortunately I cannot compare with greater k. Your algorithm is optimized for the top elements of a vector, about 5-10% while Order not. Order and nth uses C++'s nth_element
. I tried to beat the algorithm but it is optmisized. I believe your algorithm is great but it needs more works to be fully compared with C++'s. My opinion, if you succeed beating nth_element
just write a paper and then add it to your package.
I would be glad to add your name and your algorithm in Rfast.
You can try with hasna=FALSE
. I find it a bit unfortunate that Rfast in general does not deal with NA. Of course I can understand that it would make it much slower. I will have a look at nth_element. I have a very fast sorting algorithm which works for integer, logical, double...deal with NA
and is fully parallelised. It is actually faster than data.table::fsort
https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf. I haven't found anything as fast so far. I don't think I will publish it for now...it is gold... but you probably know these stuff ;-)
Oh and by the way ...topn in dev defaults to order for high n but high n is not really the purpose...for high n a proper sorting algo need to be implemented.
We cannot deal with NAs. Although in some utilities I have added as an option. If it is so fast as you said I suggest to test it with c++'s sort which is the fastest in the world, unparallelised. There is a parallel version but is unsupported by R for now. So, test them in microsoft visual or even better in linux. You will see tremendous differents. I will be glad to see your algorithm once you are about to published it.
I know it is not the purpose that is why I told you how you could compare it.
Are you making reference to std::sort in c++? Please send me links and benchmark that you have so I can compare. Sorting algorithms performance vary a lot depending on the type of data, the length of the vector to sort, and how randomly distributed data are. Thank you.
Yes I am referring to std::sort
. I don't have links and benchmarks but I am sure you can find some on your own. The benchmarks depends on the kind o f data as you said mostly depends on the compiler!
I found that descending
is not working. For example, should this call return 90 instead?
Rfast::nth(1:100, 10, descending = TRUE)
#> [1] 9
Created on 2024-01-30 with reprex v2.1.0
Sorry, I know now that 1:100
will be treated as integer in R. For now, I should use this.
Rfast::nth(as.numeric(1:100), 10, descending = TRUE)
#> [1] 91
Created on 2024-01-30 with reprex v2.1.0
It seems
nth
does not work with integersand with numeric value it does not seem consistent with
base::order