Open intheravine opened 5 years ago
An observation from my experience: I was doing a fuzzy join and ran out of RAM, but the largest dataframe was only 200,000 rows. I subsetted the two dataframes by a common identifier, did the fuzzy join for each subset, then looped across the list of identifiers - this worked very quickly. Maybe someone could check the efficiency of code across larger examples? I'm assuming making a reprex for big data examples is a hassle.
Similar as markbneal above, I was doing my first fuzzy join and ran into a vector memory exhausted error. I was doing it through a purrr::map step, joining a dataframe with about 50,000 rows onto individual rows of a dataframe with 5,000 rows. My solution was to re-write it as a for loop.
Very similar here, I was doing a fuzzy_join of 43MB file to a 68KB one, and at its peak R used 12GB of ram (almost 300 times more than individual objects!)
Error: vector memory exhausted (limit reached?)
I’m getting the above error when trying to stringdist_left_join two tables - the left table is 185K rows and the right table is 4.37M rows. The R session never appears to use more than 6GB of memory (according to Activity Monitor) while I’m on a machine with 32GB of memory with available memory in the range of 10GB when the vector memory exhausted error arises. I’ve followed various recommendations to increase R_MAX_VSIZE to a large number - 700GB as shown in the Sys.getenv() output shown below. All this to say it appears that stringdist_left_join does not pay attention to R_MAX_VSIZE. Is there some other setting I can change to use more of the available memory on my machine?