RfastOfficial / Rfast

A collection of Rfast functions for data analysis. Note 1: The vast majority of the functions accept matrices only, not data.frames. Note 2: Do not have matrices or vectors with have missing data (i.e NAs). We do no check about them and C++ internally transforms them into zeros (0), so you may get wrong results. Note 3: In general, make sure you give the correct input, in order to get the correct output. We do no checks and this is one of the many reasons we are fast.
143 stars 19 forks source link

nth error/bug ? #43

Open 2005m opened 3 years ago

2005m commented 3 years ago

It seems nth does not work with integers

x0 = c(3L, 2L, 10L, NA_integer_, 1L, 1L, NA_integer_,  NA_integer_, 10L, 20L, 20L, 20L, 30L)
order(x0, decreasing=TRUE)[10]
# [1] 6
Rfast::nth(x0, 10L, descending = TRUE,index.return = TRUE)
# Error in Rfast::nth(x0, 10L, descending = TRUE, index.return = TRUE) : 
# vector

and with numeric value it does not seem consistent with base::order

x0 = as.numeric(c(3L, 2L, 10L, NA_integer_, 1L, 1L, NA_integer_,  NA_integer_, 10L, 20L, 20L, 20L, 30L))
order(x0, decreasing=TRUE)[10]
# [1] 6
Rfast::nth(x0, 10L, descending = TRUE, index.return = TRUE)
# [1] 10
ManosPapadakis95 commented 3 years ago

It is a known bug. I suggest to convert to numeric. At least until the correction of this bug.

2005m commented 3 years ago

Thank you for the feedback. Any idea when these 2 bugs will be fixed?

ManosPapadakis95 commented 3 years ago

Your example is not correct. R's order has as defualt for NAs to remove while nth not.

Rfast::nth(x0,10,descending = T,index.return = T,na.rm = TRUE)
order(x0, decreasing=TRUE)[10]

As you can see, the result is again false! This is because you have dublicate values in you vector. Rfast::nth is not a stable algorithm while order is. This is not wrong because the index result will give you again correct values. You can test it using Rfast::Order which lets you decide which algorithm to run, stable or not.

Rfast::Order(x0,stable = T,descending = T)[10]

Tip: Rfast::Order seems to be slow compared to R's. Just test them both using a vector length of 10^7. Rfast::Order will run while R's will choose another algorithm leading to be slow…

It will take some time to correct the bug because I have found a quite faster implementation and I am planning to replace it. Be patient.

2005m commented 3 years ago

If you are interested, there is a faster implementation in my package kit and it behaves exactly like base R order with NA. You can check the tests folder.

ManosPapadakis95 commented 3 years ago

In the last month I have heard it many times. I know about topn function but it doesn't behave exactly like R's. It does what it says. Get the top elements and specifically get the 1000 top elements. So, your algorithm is a partial sorting algorithm and to be exactly fair you must compare kit::topn with Rfast::Order or Rfast::Sort. It is unfair to use Rfast::nth because it misses a step according to your algorithm so the results will not be the same.

n = 10^5, k = 100

x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 100)
  milliseconds 
                                                                 min     mean    max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 12.870 18.62030 56.023
b <- kit::topn(x, k)                                           5.707  6.74043 11.686

> all.equal(a[1:k],b)
[1] TRUE

n = 10^7, k = 990

> x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 100)
   milliseconds 
                                                                  min     mean      max
a <- Rfast::Order(x, descending = T, stable = T, partial = k)  1.3546  1.91913   9.6779
b <- kit::topn(x, k)                                          15.5001 29.01403 124.9544
> all.equal(a[1:k],b)
[1] TRUE

n = 10^7, k = 100

> x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 10)
   milliseconds 
                                                                   min      mean      max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 109.5838 136.65396 152.9554
b <- kit::topn(x, k)                                           33.5409  41.97665  69.8838
> all.equal(a[1:k],b)
[1] TRUE

n = 10^7, k = 990

> x=rnorm(n)
> Rfast2::benchmark(a<-Rfast::Order(x,descending = T,stable = T,partial = k),b<-kit::topn(x,k),times = 10)
   milliseconds 
                                                                   min     mean      max
a <- Rfast::Order(x, descending = T, stable = T, partial = k) 114.0500 154.9798 325.5320
b <- kit::topn(x, k)                                           85.4553 154.9910 258.6717
> all.equal(a[1:k],b)
[1] TRUE

Unfortunately I cannot compare with greater k. Your algorithm is optimized for the top elements of a vector, about 5-10% while Order not. Order and nth uses C++'s nth_element. I tried to beat the algorithm but it is optmisized. I believe your algorithm is great but it needs more works to be fully compared with C++'s. My opinion, if you succeed beating nth_element just write a paper and then add it to your package.
I would be glad to add your name and your algorithm in Rfast.

2005m commented 3 years ago

You can try with hasna=FALSE. I find it a bit unfortunate that Rfast in general does not deal with NA. Of course I can understand that it would make it much slower. I will have a look at nth_element. I have a very fast sorting algorithm which works for integer, logical, double...deal with NA and is fully parallelised. It is actually faster than data.table::fsort https://www.r-project.org/dsc/2016/slides/ParallelSort.pdf. I haven't found anything as fast so far. I don't think I will publish it for now...it is gold... but you probably know these stuff ;-)

Oh and by the way ...topn in dev defaults to order for high n but high n is not really the purpose...for high n a proper sorting algo need to be implemented.

ManosPapadakis95 commented 3 years ago

We cannot deal with NAs. Although in some utilities I have added as an option. If it is so fast as you said I suggest to test it with c++'s sort which is the fastest in the world, unparallelised. There is a parallel version but is unsupported by R for now. So, test them in microsoft visual or even better in linux. You will see tremendous differents. I will be glad to see your algorithm once you are about to published it.

I know it is not the purpose that is why I told you how you could compare it.

2005m commented 3 years ago

Are you making reference to std::sort in c++? Please send me links and benchmark that you have so I can compare. Sorting algorithms performance vary a lot depending on the type of data, the length of the vector to sort, and how randomly distributed data are. Thank you.

ManosPapadakis95 commented 3 years ago

Yes I am referring to std::sort. I don't have links and benchmarks but I am sure you can find some on your own. The benchmarks depends on the kind o f data as you said mostly depends on the compiler!

psychelzh commented 8 months ago

I found that descending is not working. For example, should this call return 90 instead?

Rfast::nth(1:100, 10, descending = TRUE)
#> [1] 9

Created on 2024-01-30 with reprex v2.1.0

psychelzh commented 8 months ago

Sorry, I know now that 1:100 will be treated as integer in R. For now, I should use this.

Rfast::nth(as.numeric(1:100), 10, descending = TRUE)
#> [1] 91

Created on 2024-01-30 with reprex v2.1.0