Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 979 forks source link

join vignette #2181

Open lilchow opened 7 years ago

lilchow commented 7 years ago

Several places in the available vignettes refer to this mysterious vignette about join and rolling join. When will it be up? Thanks

franknarf1 commented 7 years ago

See #944 . There are many ideas for vignettes there.

If you want some examples, you could see stackoverflow.com or maybe my notes.

arunsrinivasan commented 7 years ago

Yes, there's no time frame for this currently. See the #944 issue for progress.

batcheneden commented 7 years ago

Hi,

So..does this "joins & rolling joins" vignette exist..? since it's referred to in many of the existing (very helpful!) vignettes...but doesn't exist when searched for. need it! ...help! image

MichaelChirico commented 7 years ago

@batcheneden please read above and follow #944 for updates.

batcheneden commented 7 years ago

@MichaelChirico thank you- Just wanted to make sure indeed there isn't an existing one , despite the reference in existing vignettes.

thanks!

jangorecki commented 5 years ago

As #944 is pretty epic issue, I would like to re-open this one, asking for joins vignette only. I've seen people asking for it also on twitter, lets try to have it for 1.13.0 then.

MichaelChirico commented 5 years ago

Noting for reference: joins vignette would be a good place to have an example of replacing nested ifelse with a join.

jangorecki commented 4 years ago

summarizing the scope

zeomal commented 4 years ago

Copied from #944.

I have a question on joining by reference, while preparing the vignettes. The X[Y, new_col := old_col] performs something similar to a traditional left join on X. However, if there are multiple matches to Y's keys in X, only the last (or first?) matching value of the key is retained. ~Is this explicitly documented somewhere? I had tried searching for this back when I encountered it, but had to resort to my understanding of updating by reference for the reason~ Documented in set. For a reproducible example,

> X = data.table(a = c(1, 2, 3), m = c("a", "b", "c"))
> Y = data.table(b = c(1, 1, 4), n = c("x", "y", "z"))
> X[Y, new_col := i.n, on = "a == b"]
   a m new_col
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

# an ideal left join - expected behaviour per a new user, given below
# not possible because updating row by reference isn't implemented
   a m new_col
1: 1 a       x
1: 1 a       y
2: 2 b    <NA>
3: 3 c    <NA>

This is expected behaviour, but isn't exactly straightforward for a new user. mult does not impact the output either. Any suggestions on how I mention this in the vignette? Add merge as a workaround for a proper left join?

zeomal commented 4 years ago

@jangorecki, given that #3453 is being prepared where a detailed overview of rolling joins is being covered by @Henrik-P, would it make sense to add separate vignettes for equi- and non equi- joins, as I believe the latter is far more relevant for time series analysis? The content of both vignettes at the moment will be significant given your scope above.

jangorecki commented 4 years ago

For Joins vignette:

https://github.com/Rdatatable/data.table/issues/2396

Originally posted by @MichaelChirico in https://github.com/Rdatatable/data.table/issues/944#issuecomment-521710383

jangorecki commented 4 years ago

@zeomal better to have 2 bigger vignettes, than 3 smaller IMO. We already have many vignettes.

zeomal commented 4 years ago

@jangorecki, I've created a draft pull request for this vignette. It's a first version, bound to have many changes, but covers the basics of equi-joins. This is my first pull request ever, so if I've done something wrong, please correct me.

jangorecki commented 4 years ago

mult usage can nicely present temporal joins, like:

zeomal commented 3 years ago

@jangorecki it's been a while since the mergelist and changes to nomatch have been there on the milestones, so I thought I might go ahead an complete the vignettes, at least from an equi-join standpoint. Currently, I'm using the nycflights13 package , but it looks like that isn't present on travis-ci and thus the build is failing. You also mentioned to use a known dataset - nycflights13 fits that criteria, yeah? Is there some alternate way I can link to the package or data that doesn't involve loading it (I'm not sure linking directly to the CSV files on Github is appropriate)?

jangorecki commented 3 years ago

@zeomal we have flights dataset in vignettes directory, see https://github.com/Rdatatable/data.table/blob/7aa22ee6b245b9308352acd66384373a99376c13/vignettes/datatable-intro.Rmd#L50-L55 for examples of usage.

zeomal commented 3 years ago

@jangorecki correct me if I'm wrong - I think that this includes only the flights dataset of the series. That is, airlines, planes, airports, weather are missing - which would be needed to demonstrate joins. I don't see any of the other files here either, so let me know if I'm missing something. The only other dataset I might be able to use is the Lahman one, but present as RData files in the same location (used in the .SD vignette).

jangorecki commented 3 years ago

@zeomal you are absolutely correct.

zeomal commented 3 years ago

So... my question is, what should I use as a data source for the vignette that will be feasible? @jangorecki

jangorecki commented 3 years ago

No worries, I am not replying because I don't have any valuable answer to questions asked, at the moment. Choosing datasets for presenting join cases is very complex task. I don't want to make recommendation without first properly investigating the problem. There are many factors to consider and it is not easy to construct dummy datasets that will cover them well, so choosing real datasets is even more tricky. Joining should present (of course) matches, some matches, zero matches, duplicated matches (many-to-many join), rolling matches, mult arg, overlapping, non-equi, multi col, matches on NA.... Complete list of features to cover could be really long.

Personally I think that using multiple different datasets (each 10-20 rows to be visually easier to grasp) in this single vignette could be more appropriate but do not take that as a suggestion to write this vignette now like this. I would like to hear from @mattdowle and @arunsrinivasan as they are authors of (AFAIR) all join scenarios in data.table.

GitHunter0 commented 3 years ago

Kind of funny this mysterious vignette, I'm really looking forward to it too

avimallu commented 3 years ago

@mattdowle, sorry about the second ping within two messages, wanted to get your thoughts on how I should proceed with the join vignettes. All the data.table vignettes helped me get up to speed quickly, except for joins (information for which is scattered across multiple Stackoverflow answers and FAQs).

The most recent PR (now closed) #4398 used nycflights data and requires installation on Travis and/or Appveyor, and also a conversion from tibble to data.table. Will you be able to comment on the kind of data to contain? I'm keen on preparing this vignette.

dvg-p4 commented 2 years ago

Are there any plans to actually write this vignette, or otherwise document how joins work in this package?

avimallu commented 2 years ago

Are there any plans to actually write this vignette, or otherwise document how joins work in this package?

Joins are well documented in the FAQ and in the documentation itself. It's not all in one place, however.

jangorecki commented 2 years ago

By "documentation" @avimallu meant package manual. Aside from what @avimallu mentioned I could add stackoverflow as well. I am pretty sure join vignette will be added at some point. We not only want to document joins there, there is a manual for that already, but we would like to have a nice guided story that goes through different join scenarios that still will be easy to grasp and not overwhelming. Writing such a good vignette is not that simple.

MichaelChirico commented 2 years ago

Agree with Jan -- it's one thing to have simple snippets in Rd pages, constructing a coherent narrative with compelling (and publicly available!) data is a different beast. Contributions definitely welcome here.

If there's any piece you think is missing (edge cases, certain parameters/their interaction, etc) from the Rd docs, please flag that and it'll be easier to fix.

JaimeArboleda commented 2 years ago

Hello!! I don't know if this is the most recent issue commenting this missing vignette. But I landed on data.table a couple of days ago and I wanted to learn the framework thoroughly following the vignettes in order. I was sad finding out that this vignette does not exist yet.

By the way, I wanted to thank you all for this package. I come from Python (I am very used to pandas), and lately I started using R and I was enjoying the tidyverse approach (which is the most usually taught). But when I started with this library I was sooooooo blown away by it. I mean, it's just amazingly good: even if it were not the fastest library ever, I would be using it just for the clever syntax. I cannot emphasize more how much I love it!!! So thanks a lot for this wonderful creation.

If I could be of any help with this vignette that would be great. Is there anyone working on it? Is there any branch with a draft of the vignette? I have no idea...

jangorecki commented 2 years ago

Thank you for warm words. My impression about DT was quite similar when I arrived to it :) top speed and low memory are just nice bonus to the best syntax.

As for learning joins, you can go through the list of join features mentioned in this issue, and look it up in ?data.table manual and stackoverflow. There was a draft of join vignette, or maybe even two, but they were far from complete, so I doubt if the one will succeed as vignette ultimately.

JaimeArboleda commented 2 years ago

Yeah, but to be honest it took me some time to decide to invest in it, because my wrong impression, created by many shared opinions in blogs and discussions forums, was that the syntax was ugly and difficult to understand. And things like the mere existence of tidytable reinforce this idea (somewhat recognizing that data.table can be improved wity tidy syntax).

I mean, I think it's good that both syntax approaches exist (specially, being so orthogonal), and that different people can use R they way they prefer. The only think that makes me sad is that I feel data.table is underpromoted and has an undeserved aura of obscurity. At least, that was my perception.

Thanks a lot for your suggestion. I will start with your approach and hopefully I will be able to understand it.

tdhock commented 1 year ago

+1 I found it confusing that this vignette is mentioned in datatable-intro, but can not be found/read. Is there another reference that we can use for teaching people how to do joins?

AngelFelizR commented 11 months ago

@jangorecki, Could I use the Taylor Swift Tidytuesday dataset to create the vignette?

I can explain what I learnt in the Joining Data with data.table in R Datacamp course

waynelapierre commented 11 months ago

Cannot believe that this issue has been around for so long. This is actually a bug. It is better not to mention it at all.

jangorecki commented 11 months ago

There are already drafts or work in progress of this vignette, IIRC 2 or even 3, so probably it will be good place to start from rather than adding the next one.

avimallu commented 11 months ago

@AngelFelizR, take a look at https://github.com/Rdatatable/data.table/pull/4398 for inspiration and which issues the join vignette could close.

AngelFelizR commented 11 months ago

Thanks @avimallu, I will work to have a first draft by 2023-11-27

dvg-p4 commented 11 months ago

I've found https://medium.com/analytics-vidhya/r-data-table-joins-48f00b46ce29 to be quite helpful as well. Though since that's on a personal blog you'll probably want to contact the author for permission if you wanted to copy from it for the vignette.

AngelFelizR commented 10 months ago

After reading all the comments related to this issue, I found out that the vignette must be created with simulated data. This approach will demonstrate how to use the package in various situations, from using a short data.table of 5 rows to avoiding unnecessary dependencies. It’s important to keep the story from becoming overwhelming.

Here is the basic structure that I will be creating:

  1. merge function
    • Inner join
    • Right join
    • Replacing nested ifelse with a join
    • Left join
    • Full join
    • Different col names
    • Many to many join (allow.cartesian)
  2. data.table syntax joins
    • Right join
    • Keyed joins
    • Natural join
    • Update x on join
    • Aggregate on join (by = .EACHI or by = x's columns)
    • Editing x based on i matching columns by x groups ()
    • i. and x. j's prefixes
    • Inner join
    • Not join
    • Many to many join
    • allow.cartesian
    • mult
    • Non-equi join (>=, >, <=, <)
    • Rolling join
    • Semi join
    • Cross join
  3. Merging many tables (Reduce(merge, list(DT1,DT2,DT3,...)))

Please let me know if I am missing something.

tdhock commented 10 months ago

I used the flights data to explain joins, in my slides for the data.table tutorial at the LatinR meeting last month, https://github.com/tdhock/2023-10-LatinR-data.table#english

jangorecki commented 10 months ago

mergelist PR is ready to merge so probably will land in master before the vignette, so should be included as well

foverlaps is missing

avimallu commented 10 months ago

I think we should avoid the merge function other than as a side note. One of data.table's strengths is its merge syntax, and that is what should be front, right and centre.

In addition, the overlap join functions have a separate syntax, it might be worth placing all syntactically similar joins together to have them all in one place.

AngelFelizR commented 10 months ago

@avimallu

I think we should avoid the merge function other than as a side note. One of data.table's strengths is its merge syntax, and that is what should be front, right and centre.

I started the vignette with the merge function as is easier to understand for new users. In my case is normal to use many merge function in chain following the next syntax as is the only way to apply left join.

DtMerged <-
DT1[...
][, merge(.SD, DT2, by = "x", all.x = TRUE)
][, merge(.SD, DT3, by = "y", all.x = TRUE)]

What I could do is to move the mergelist from point 3 to point 2 to avoid the switch from function to syntax.

In addition, the overlap join functions have a separate syntax, it might be worth placing all syntactically similar joins together to have them all in one place.

I thought that overlap join is an application of non-equi join.

jangorecki commented 10 months ago

overlapping join (foverlaps api) is a special case of non-equi join ([.data.table api).

I second @avimallu suggestion about dropping merge. Mentioning it and linking manual sounds good, but it is sub-optimal in performance and in features so we should try not to onboard users into it. mergelist is good substitute.

AngelFelizR commented 10 months ago

Here is the new structure

  1. data.table syntax joins
    • Right join
    • Left join (custom function)
    • Keyed joins
    • Natural join
    • Update x on join
    • Aggregate on join (by = .EACHI or by = x's columns)
    • Editing x based on i matching columns by x groups ()
    • i. and x. j's prefixes
    • Inner join
    • Not join
    • Many to many join
    • allow.cartesian
    • mult
    • Non-equi join (>=, >, <=, <)
    • Rolling join
    • Semi join
    • Cross join
  2. Joining a list of tables (mergelist)
  3. Full join (by using merge and link to documentation)
jangorecki commented 10 months ago

nb. mergelist supports full join as well, probably much more efficient than merge

AngelFelizR commented 10 months ago

nb. mergelist supports full join as well, probably much more efficient than merge

That's sounds really good.

So the will the vignette's structure:

  1. data.table syntax joins
    • Right join
    • Left join (custom function)
    • Keyed joins
    • Natural join
    • Update x on join
    • Aggregate on join (by = .EACHI or by = x's columns)
    • Editing x based on i matching columns by x groups ()
    • i. and x. j's prefixes
    • Inner join
    • Not join
    • Many to many join
    • allow.cartesian
    • mult
    • Non-equi join (>=, >, <=, <)
    • Rolling join
    • Semi join
    • Cross join
  2. Full join and joining a list of tables (mergelist)

Link to merge function documentation

jangorecki commented 10 months ago

Moreover, mergelist has more mult options as far as I recall.

Aggregate on join is not possible by x's column yet. Only by each I.