Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Extend functionality of nafill to use 'fill' argument for all 'type's #3594

Closed ben519 closed 3 years ago

ben519 commented 5 years ago

Suppose I have this vector

x = c(NA,1,NA,NA,5,3,NA,NA)

and I want to fill NAs with the preceding non NA values. I can do this

nafill(x = c(NA,1,NA,NA,5,3,NA,NA), type = "locf")
[1] NA  1  1  1  5  3  3  3

Great, but sometimes I want to specify a fill value to catch the NA(s) at the front of the vector. I tried this which seemed obvious to me,

nafill(x = c(NA,1,NA,NA,5,3,NA,0), type = "locf", fill = -1)

but it didn't work and instead gave a warning, "argument 'fill' ignored, only make sense for type='const'".

My request is to extend the method so that 'fill' is applied to the front/back of the vector for types 'locf' and 'nocb' respectively. Thanks

Output of sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin18.5.0 (64-bit)
Running under: macOS Mojave 10.14.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.6/lib/libopenblasp-r0.3.6.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.3

loaded via a namespace (and not attached):
[1] compiler_3.6.0 tools_3.6.0
jangorecki commented 5 years ago

Thanks. For now the you can use

nafill(nafill(x = c(NA,1,NA,NA,5,3,NA,0), type = "locf"), fill = -1)
saraswatmks commented 5 years ago

@jangorecki if no one else is working on this, can I take it up ?

MichaelChirico commented 5 years ago

@saraswatmks I just assigned you, go for it!

jangorecki commented 5 years ago

@saraswatmks note that nafill tests are in inst/tests/nafill.Rraw, so new tests should goes there. It can be useful to avoid merge conflicts and to easily test only this script test.data.table(script="inst/tests/nafill.Rraw")

saraswatmks commented 5 years ago

@MichaelChirico thanks! I see this function is still in dev. How do I reproduce it on my local machine ? My local master is up to date with remote. If I do Clean and Rebuild, I get this error (I am kind of stuck on this):

==> R CMD INSTALL --preclean --no-multiarch --with-keep.source data.table

* installing to library ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library’
* installing *source* package ‘data.table’ ...
** libs
/usr/local/opt/llvm/bin/clang -fopenmp -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/usr/local/opt/gettext/include -I/usr/local/opt/llvm/include  -fopenmp -fPIC  -g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe -c assign.c -o assign.o
In file included from assign.c:1:
In file included from ./data.table.h:1:
/Library/Frameworks/R.framework/Resources/include/R.h:55:11: fatal error: 'stdlib.h' file not found
# include <stdlib.h> /* Not used by R itself, but widely assumed in packages */
          ^~~~~~~~~~
1 error generated.
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’
* removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/data.table’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/data.table’

Exited with status 1.
jangorecki commented 5 years ago

@saraswatmks I found that when working with RStudio features like Clean and Rebuild it was actually resulting into more time wasted into debugging issues than it helped. Although for package where no compile code was present it was much more reliable. Anyway, I suggest to use cc() which is much faster, and AFAIR never introduced issues that would waste my time for debugging. more info in https://github.com/Rdatatable/data.table/tree/master/.dev

MichaelChirico commented 5 years ago

@saraswatmks regarding your installation error:

/Library/Frameworks/R.framework/Resources/include/R.h:55:11: fatal error: 'stdlib.h' file not found
# include <stdlib.h> /* Not used by R itself, but widely assumed in packages */

I just came across the same issue on my new laptop. It appears something in the installation of Developer Tools was mixed up. Found this comment on another repo:

https://github.com/catboost/catboost/issues/137#issuecomment-424595790

And it worked on my machine. Hope it can help.

jangorecki commented 4 years ago

related issue https://github.com/Rdatatable/data.table/issues/3700

MichaelChirico commented 4 years ago

Just happened on a use case for this playing around with COVID data 😃

library(data.table)
URL = file.path(
  'https://raw.githubusercontent.com',
  'nytimes/covid-19-data/master/us-counties.csv'
)
covid = fread(URL, colClasses = c(date = 'IDate'), key = 'state,county,date')
covid[state == 'Pennsylvania', dcast(.SD, date ~ county, value.var = 'cases')][ , 1:5]
#           date Adams Allegheny Armstrong Beaver
#  1: 2020-03-06    NA        NA        NA     NA
#  2: 2020-03-07    NA        NA        NA     NA
#  3: 2020-03-08    NA        NA        NA     NA
#  4: 2020-03-09    NA        NA        NA     NA
#  5: 2020-03-10    NA        NA        NA     NA
#  6: 2020-03-11    NA        NA        NA     NA
#  7: 2020-03-12    NA        NA        NA     NA
#  8: 2020-03-13    NA        NA        NA     NA
#  9: 2020-03-14    NA         1        NA     NA
# 10: 2020-03-15    NA         3        NA     NA
# 11: 2020-03-16    NA         5        NA     NA
# 12: 2020-03-17    NA        10        NA      1
# 13: 2020-03-18     1        12        NA      2
# 14: 2020-03-19     2        18        NA      2
# 15: 2020-03-20     5        28        NA      3
# 16: 2020-03-21     5        31        NA      3
# 17: 2020-03-22     5        40        NA      3
# 18: 2020-03-23     6        48        NA      3
# 19: 2020-03-24     6        58         1      3
# 20: 2020-03-25     6        88         1      7
# 21: 2020-03-26     7       133         1     13
#           date Adams Allegheny Armstrong Beaver

I want to fill the initial missing values with 0 (which is correct), but would use LOCF to carry-forward most recent data in the event it's missing (not observed here...)

lapply(.SD, nafill, type = 'locf', fill = 0) seems natural enough.

jangorecki commented 4 years ago

AFAIR nafill is vectorized, so no need to lapply .SD it. nafill(.SD) should be enough, and also parallel.