[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist)

arunsrinivasan commented 10 years ago

Submitted by: Patrick Nicholson; Assigned to: Nobody; R-Forge link

Many large datasets are split into multiple tables, especially when they are release as flat files. Many datasets that track a lot of variables over time are released as separate files for separate periods. It is useful to write a quick wrapper to read these files into a list:

tabs <- lapply(dir(), function(file) as.data.table(read.csv(file)))

If we were interested in appending the tables in this list, data.table provides a very fast and useful function, rbindlist. Similar functionality exists in SAS DATA steps when you can list multiple datasets in the SET statement to append them. However, SAS also allows you to list many tables in a merge statement.* Without a BY variable, this amounts to do.call("cbind", ...) in R. But with a BY variable....

I am proposing a function that would merge data.tables contained within a list. This would So that the following is possible:

tabs <- lapply(dir(), function(file) data.table(read.csv(file), key="primary_key")) data <- do.call("[", tabs) or data <- mergelist(tabs)

This would be a killer feature. It does not exist elsewhere in R, as far as I can tell. It would allow data.table code to be more concise and require less updating. (Think about going from creating t2011, t2012, t2013... and merging them with t2011[t2012[t2013.... Now think about higher frequency data!) It would also take a bullet out of SAS's gun.

This is the comparable SAS approach if it helps:

%macro readfiles(number_of_files); %do i=1 %to &number_of_files; proc import out=imported&i. datafile="C:/file&number_of_files..csv" dbms=csv; run; %end; %mend;

data merged; set imported:; by primary_key; run;

proc datasets library=work nolist; delete imported:; run;

mattdowle commented 10 years ago

Also similar to #694 .

abnova commented 10 years ago

Hello, Matt! Thank you for inviting me here. First of all, I apologize for using an abbreviation without first defining it - it's not my style, I was just trying to save some space in the comment area. Having said that, I can tell that my abbreviation SEM refers to structural equation modeling, a very popular set of quantitative research methods, based on multivariate statistical analysis (http://en.wikipedia.org/wiki/Structural_equation_modeling).

As matrices represent a foundation of SEM models, functions, implementing SEM analysis methods, operate on matrices or corresponding data frames. These data structures often have to be constructed from several underlying matrices or data frames, which represent certain variables or indicators in a SEM model. For example, consider my very simple test module in R, which attempts to perform a PLS-PM analysis (a variant of SEM analysis) on a test dataset: https://github.com/abnova/diss-floss/blob/master/analysis/sem-models/floss-success-v2-flat/floss-success-plspm.R. You see that I have to merge several data frames, each representing a particular indicator in SEM model, into a summary data frame, which I then pass to the SEM analysis function plspm() on line 211.

I hope that my explanation is clear enough to get an idea about a scenario, that requires merging multiple data frames or data tables (the corresponding number of indicators in SEM models may be significant for large and complex models, making manual approach to merge not feasible). Your questions or comments are welcome!

mattdowle commented 10 years ago

@abnova That's very useful info, thanks. Does the joinbyv function in the pull request (#694) from @jangorecki work for this? And is that the same as the mergelist() from Patrick above as well? I mentioned in a comment on #694 that it'd be nice to tie this up with secondary keys as well to avoiding needing to setkey() each data.table input.

abnova commented 10 years ago

@mattdowle You're welcome! I will take a look at functions and issues you're referring to - can't say anything now, as I'm not familiar with those. It might take some time, especially considering my lack of knowledge of data.table codebase and even its user-level functionality. But, I will be in touch eventually.

geneorama commented 9 years ago

This would be awesome. I often am reconciling data across several data.frames.

jangorecki commented 9 years ago

Just to clarify, #694 code is slightly outdated. The latest version of joinbyv is available here: https://github.com/jangorecki/dwtools/blob/master/R/joinbyv.R It should not have any other depedency than data.table so you should be able just to run the joinbyv.R code and use function directly without the need of installing dwtools. In case if you want to read data from csv files or database it could be even combined with db function from dwtools.

MichaelChirico commented 5 years ago

I guess #2576 should be closed at the same time.

Like rbind(...) builds an rbindlist call, mergelist build a cbindlist call?

jangorecki commented 5 years ago

no, mergelist would iterate joins, cbind just stack datasets