Open lilchow opened 7 years ago
See #944 . There are many ideas for vignettes there.
If you want some examples, you could see stackoverflow.com or maybe my notes.
Yes, there's no time frame for this currently. See the #944 issue for progress.
Hi,
So..does this "joins & rolling joins" vignette exist..? since it's referred to in many of the existing (very helpful!) vignettes...but doesn't exist when searched for. need it! ...help!
@batcheneden please read above and follow #944 for updates.
@MichaelChirico thank you- Just wanted to make sure indeed there isn't an existing one , despite the reference in existing vignettes.
thanks!
As #944 is pretty epic issue, I would like to re-open this one, asking for joins vignette only. I've seen people asking for it also on twitter, lets try to have it for 1.13.0 then.
Noting for reference: joins vignette would be a good place to have an example of replacing nested ifelse
with a join.
summarizing the scope
on
fields)[f]ifelse
and fcase
, and lazyness of the latter one)on
or with key, or pre-computing index - explain it is good to setkey/setindex if you are repeatedly joiningCopied from #944.
I have a question on joining by reference, while preparing the vignettes. The X[Y, new_col := old_col]
performs something similar to a traditional left join on X. However, if there are multiple matches to Y
's keys in X
, only the last (or first?) matching value of the key is retained. ~Is this explicitly documented somewhere? I had tried searching for this back when I encountered it, but had to resort to my understanding of updating by reference for the reason~ Documented in set
. For a reproducible example,
> X = data.table(a = c(1, 2, 3), m = c("a", "b", "c"))
> Y = data.table(b = c(1, 1, 4), n = c("x", "y", "z"))
> X[Y, new_col := i.n, on = "a == b"]
a m new_col
1: 1 a y
2: 2 b <NA>
3: 3 c <NA>
# an ideal left join - expected behaviour per a new user, given below
# not possible because updating row by reference isn't implemented
a m new_col
1: 1 a x
1: 1 a y
2: 2 b <NA>
3: 3 c <NA>
This is expected behaviour, but isn't exactly straightforward for a new user. mult
does not impact the output either. Any suggestions on how I mention this in the vignette? Add merge as a workaround for a proper left join?
@jangorecki, given that #3453 is being prepared where a detailed overview of rolling joins is being covered by @Henrik-P, would it make sense to add separate vignettes for equi- and non equi- joins, as I believe the latter is far more relevant for time series analysis? The content of both vignettes at the moment will be significant given your scope above.
For Joins vignette:
https://github.com/Rdatatable/data.table/issues/2396
Originally posted by @MichaelChirico in https://github.com/Rdatatable/data.table/issues/944#issuecomment-521710383
@zeomal better to have 2 bigger vignettes, than 3 smaller IMO. We already have many vignettes.
@jangorecki, I've created a draft pull request for this vignette. It's a first version, bound to have many changes, but covers the basics of equi-joins. This is my first pull request ever, so if I've done something wrong, please correct me.
mult
usage can nicely present temporal joins, like:
setkey(customer_addresses, customer_id, address_date)
customer_addresses[customer, mult="last", on="customer_id"]
customer_addresses[address_date<as.IDate("2020-01-01")
][customer, mult="last", on="customer_id"]
Probably better to use some know dataset, like airlines, whenever possible.
@jangorecki it's been a while since the mergelist
and changes to nomatch
have been there on the milestones, so I thought I might go ahead an complete the vignettes, at least from an equi-join standpoint. Currently, I'm using the nycflights13
package , but it looks like that isn't present on travis-ci and thus the build is failing. You also mentioned to use a known dataset - nycflights13
fits that criteria, yeah? Is there some alternate way I can link to the package or data that doesn't involve loading it (I'm not sure linking directly to the CSV files on Github is appropriate)?
@zeomal we have flights dataset in vignettes directory, see https://github.com/Rdatatable/data.table/blob/7aa22ee6b245b9308352acd66384373a99376c13/vignettes/datatable-intro.Rmd#L50-L55 for examples of usage.
@jangorecki correct me if I'm wrong - I think that this includes only the flights
dataset of the series. That is, airlines
, planes
, airports
, weather
are missing - which would be needed to demonstrate joins. I don't see any of the other files here either, so let me know if I'm missing something. The only other dataset I might be able to use is the Lahman one, but present as RData files in the same location (used in the .SD vignette).
@zeomal you are absolutely correct.
So... my question is, what should I use as a data source for the vignette that will be feasible? @jangorecki
No worries, I am not replying because I don't have any valuable answer to questions asked, at the moment. Choosing datasets for presenting join cases is very complex task. I don't want to make recommendation without first properly investigating the problem. There are many factors to consider and it is not easy to construct dummy datasets that will cover them well, so choosing real datasets is even more tricky. Joining should present (of course) matches, some matches, zero matches, duplicated matches (many-to-many join), rolling matches, mult arg, overlapping, non-equi, multi col, matches on NA.... Complete list of features to cover could be really long.
Personally I think that using multiple different datasets (each 10-20 rows to be visually easier to grasp) in this single vignette could be more appropriate but do not take that as a suggestion to write this vignette now like this. I would like to hear from @mattdowle and @arunsrinivasan as they are authors of (AFAIR) all join scenarios in data.table.
Kind of funny this mysterious vignette, I'm really looking forward to it too
@mattdowle, sorry about the second ping within two messages, wanted to get your thoughts on how I should proceed with the join vignettes. All the data.table
vignettes helped me get up to speed quickly, except for joins (information for which is scattered across multiple Stackoverflow answers and FAQs).
The most recent PR (now closed) #4398 used nycflights
data and requires installation on Travis and/or Appveyor, and also a conversion from tibble
to data.table
. Will you be able to comment on the kind of data to contain? I'm keen on preparing this vignette.
Are there any plans to actually write this vignette, or otherwise document how joins work in this package?
Are there any plans to actually write this vignette, or otherwise document how joins work in this package?
Joins are well documented in the FAQ and in the documentation itself. It's not all in one place, however.
By "documentation" @avimallu meant package manual. Aside from what @avimallu mentioned I could add stackoverflow as well. I am pretty sure join vignette will be added at some point. We not only want to document joins there, there is a manual for that already, but we would like to have a nice guided story that goes through different join scenarios that still will be easy to grasp and not overwhelming. Writing such a good vignette is not that simple.
Agree with Jan -- it's one thing to have simple snippets in Rd pages, constructing a coherent narrative with compelling (and publicly available!) data is a different beast. Contributions definitely welcome here.
If there's any piece you think is missing (edge cases, certain parameters/their interaction, etc) from the Rd docs, please flag that and it'll be easier to fix.
Hello!! I don't know if this is the most recent issue commenting this missing vignette. But I landed on data.table a couple of days ago and I wanted to learn the framework thoroughly following the vignettes in order. I was sad finding out that this vignette does not exist yet.
By the way, I wanted to thank you all for this package. I come from Python (I am very used to pandas), and lately I started using R and I was enjoying the tidyverse approach (which is the most usually taught). But when I started with this library I was sooooooo blown away by it. I mean, it's just amazingly good: even if it were not the fastest library ever, I would be using it just for the clever syntax. I cannot emphasize more how much I love it!!! So thanks a lot for this wonderful creation.
If I could be of any help with this vignette that would be great. Is there anyone working on it? Is there any branch with a draft of the vignette? I have no idea...
Thank you for warm words. My impression about DT was quite similar when I arrived to it :) top speed and low memory are just nice bonus to the best syntax.
As for learning joins, you can go through the list of join features mentioned in this issue, and look it up in ?data.table manual and stackoverflow. There was a draft of join vignette, or maybe even two, but they were far from complete, so I doubt if the one will succeed as vignette ultimately.
Yeah, but to be honest it took me some time to decide to invest in it, because my wrong impression, created by many shared opinions in blogs and discussions forums, was that the syntax was ugly and difficult to understand. And things like the mere existence of tidytable
reinforce this idea (somewhat recognizing that data.table can be improved wity tidy
syntax).
I mean, I think it's good that both syntax approaches exist (specially, being so orthogonal), and that different people can use R they way they prefer. The only think that makes me sad is that I feel data.table is underpromoted and has an undeserved aura of obscurity. At least, that was my perception.
Thanks a lot for your suggestion. I will start with your approach and hopefully I will be able to understand it.
+1 I found it confusing that this vignette is mentioned in datatable-intro, but can not be found/read. Is there another reference that we can use for teaching people how to do joins?
@jangorecki, Could I use the Taylor Swift Tidytuesday dataset to create the vignette?
I can explain what I learnt in the Joining Data with data.table in R Datacamp course
Cannot believe that this issue has been around for so long. This is actually a bug. It is better not to mention it at all.
There are already drafts or work in progress of this vignette, IIRC 2 or even 3, so probably it will be good place to start from rather than adding the next one.
@AngelFelizR, take a look at https://github.com/Rdatatable/data.table/pull/4398 for inspiration and which issues the join vignette could close.
Thanks @avimallu, I will work to have a first draft by 2023-11-27
I've found https://medium.com/analytics-vidhya/r-data-table-joins-48f00b46ce29 to be quite helpful as well. Though since that's on a personal blog you'll probably want to contact the author for permission if you wanted to copy from it for the vignette.
After reading all the comments related to this issue, I found out that the vignette must be created with simulated data. This approach will demonstrate how to use the package in various situations, from using a short data.table of 5 rows to avoiding unnecessary dependencies. It’s important to keep the story from becoming overwhelming.
Here is the basic structure that I will be creating:
merge
function
data.table
syntax joins
x
on joinx
based on i
matching columns by x
groups ()i.
and x.
j's prefixesPlease let me know if I am missing something.
I used the flights data to explain joins, in my slides for the data.table tutorial at the LatinR meeting last month, https://github.com/tdhock/2023-10-LatinR-data.table#english
mergelist PR is ready to merge so probably will land in master before the vignette, so should be included as well
foverlaps is missing
I think we should avoid the merge function other than as a side note. One of data.table
's strengths is its merge syntax, and that is what should be front, right and centre.
In addition, the overlap join functions have a separate syntax, it might be worth placing all syntactically similar joins together to have them all in one place.
@avimallu
I think we should avoid the merge function other than as a side note. One of
data.table
's strengths is its merge syntax, and that is what should be front, right and centre.
I started the vignette with the merge function as is easier to understand for new users. In my case is normal to use many merge function in chain following the next syntax as is the only way to apply left join.
DtMerged <-
DT1[...
][, merge(.SD, DT2, by = "x", all.x = TRUE)
][, merge(.SD, DT3, by = "y", all.x = TRUE)]
What I could do is to move the mergelist from point 3 to point 2 to avoid the switch from function to syntax.
In addition, the overlap join functions have a separate syntax, it might be worth placing all syntactically similar joins together to have them all in one place.
I thought that overlap join is an application of non-equi join.
overlapping join (foverlaps
api) is a special case of non-equi join ([.data.table
api).
I second @avimallu suggestion about dropping merge
. Mentioning it and linking manual sounds good, but it is sub-optimal in performance and in features so we should try not to onboard users into it. mergelist
is good substitute.
Here is the new structure
data.table
syntax joins
x
on joinx
based on i
matching columns by x
groups ()i.
and x.
j's prefixesmergelist
)merge
and link to documentation)nb. mergelist supports full join as well, probably much more efficient than merge
nb. mergelist supports full join as well, probably much more efficient than merge
That's sounds really good.
So the will the vignette's structure:
data.table
syntax joins
x
on joinx
based on i
matching columns by x
groups ()i.
and x.
j's prefixesmergelist
)Link to merge
function documentation
Moreover, mergelist has more mult options as far as I recall.
Aggregate on join is not possible by x's column yet. Only by each I.
Several places in the available vignettes refer to this mysterious vignette about join and rolling join. When will it be up? Thanks