dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

Error: vector memory exhausted (limit reached?) #51

Open intheravine opened 5 years ago

intheravine commented 5 years ago

Error: vector memory exhausted (limit reached?)

I’m getting the above error when trying to stringdist_left_join two tables - the left table is 185K rows and the right table is 4.37M rows. The R session never appears to use more than 6GB of memory (according to Activity Monitor) while I’m on a machine with 32GB of memory with available memory in the range of 10GB when the vector memory exhausted error arises. I’ve followed various recommendations to increase R_MAX_VSIZE to a large number - 700GB as shown in the Sys.getenv() output shown below. All this to say it appears that stringdist_left_join does not pay attention to R_MAX_VSIZE. Is there some other setting I can change to use more of the available memory on my machine?

Sys.getenv()

Apple_PubSub_Socket_Render          /private/tmp/com.apple.launchd.sSrL33I64Z/Render
COLUMNS                             80
COMMAND_MODE                        unix2003
DISPLAY                             /private/tmp/com.apple.launchd.tTt2eLd6xQ/org.macosforge.xquartz:0
DYLD_FALLBACK_LIBRARY_PATH          /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
DYLD_LIBRARY_PATH                   /Library/Java/JavaVirtualMachines/jdk1.8.0_91.jdk/Contents/Home/jre/lib/server
EDITOR                              vi
HOME                                /Users/geoffreysnyder
LD_LIBRARY_PATH                     :@JAVA_LD@
LINES                               24
LN_S                                ln -s
LOGNAME                             geoffreysnyder
MAKE                                make
PAGER                               /usr/bin/less
PATH                                /usr/local/bin:/usr/local/mysql/bin:/usr/bin:/bin:/usr/sbin:/sbin:/opt/X11/bin:~/Library/Python/3.7/bin
PWD                                 /Users/geoffreysnyder/repos/Data_Load/code
R_ARCH                              
R_BROWSER                           /usr/bin/open
R_BZIPCMD                           /usr/bin/bzip2
R_DOC_DIR                           /Library/Frameworks/R.framework/Resources/doc
R_GZIPCMD                           /usr/bin/gzip
R_HOME                              /Library/Frameworks/R.framework/Resources
R_INCLUDE_DIR                       /Library/Frameworks/R.framework/Resources/include
R_LIBS_SITE                         
R_LIBS_USER                         ~/Library/R/3.5/library
R_MAX_VSIZE                         700GB
R_PAPERSIZE                         a4
R_PDFVIEWER                         /usr/bin/open
R_PLATFORM                          x86_64-apple-darwin15.6.0
R_PRINTCMD                          lpr
R_QPDF                              /Library/Frameworks/R.framework/Resources/bin/qpdf
R_RD4PDF                            times,inconsolata,hyper
R_SESSION_TMPDIR                    /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T//RtmpJdct7Y
R_SHARE_DIR                         /Library/Frameworks/R.framework/Resources/share
R_SYSTEM_ABI                        osx,gcc,gxx,gfortran,?
R_TEXI2DVICMD                       /usr/local/bin/texi2dvi
R_UNZIPCMD                          /usr/bin/unzip
R_ZIPCMD                            /usr/bin/zip
SECURITYSESSIONID                   186a8
SED                                 /usr/bin/sed
SHELL                               /bin/zsh
SHLVL                               0
SSH_AUTH_SOCK                       /private/tmp/com.apple.launchd.UNOOV1wxev/Listeners
SUBLIMEREPL_AC_IP                   127.0.0.1
SUBLIMEREPL_AC_PORT                 None
TAR                                 /usr/bin/tar
TMPDIR                              /var/folders/xw/402kc2hc8xl82d008k8x64f00000gq/T/
TZ                                  America/Los_Angeles
USER                                geoffreysnyder
XPC_FLAGS                           0x0
XPC_SERVICE_NAME                    0
__CF_USER_TEXT_ENCODING             0x1F7:0x0:0x0
sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.14.2

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2  RJDBC_0.2-7.1   rJava_0.9-10    DBI_1.0.0       fuzzyjoin_0.1.4 readr_1.2.0     dplyr_0.7.8    
[8] lubridate_1.7.4 stringr_1.3.1  

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       tidyr_0.8.2      assertthat_0.2.0 R6_2.3.0         magrittr_1.5     pillar_1.2.3    
 [7] rlang_0.3.0.1    stringi_1.2.4    tools_3.5.1      glue_1.3.0       purrr_0.2.5      hms_0.4.2.9000  
[13] compiler_3.5.1   pkgconfig_2.0.2  bindr_0.1.1      tidyselect_0.2.5 tibble_1.4.2    
markbneal commented 4 years ago

An observation from my experience: I was doing a fuzzy join and ran out of RAM, but the largest dataframe was only 200,000 rows. I subsetted the two dataframes by a common identifier, did the fuzzy join for each subset, then looped across the list of identifiers - this worked very quickly. Maybe someone could check the efficiency of code across larger examples? I'm assuming making a reprex for big data examples is a hassle.

aranryan commented 3 years ago

Similar as markbneal above, I was doing my first fuzzy join and ran into a vector memory exhausted error. I was doing it through a purrr::map step, joining a dataframe with about 50,000 rows onto individual rows of a dataframe with 5,000 rows. My solution was to re-write it as a for loop.

Erinaceida commented 3 years ago

Very similar here, I was doing a fuzzy_join of 43MB file to a 68KB one, and at its peak R used 12GB of ram (almost 300 times more than individual objects!)