Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 982 forks source link

fread feather #2026

Open mattdowle opened 7 years ago

mattdowle commented 7 years ago

As suggested here to avoid needing to use or wrap with setDT : https://twitter.com/bennetvoorhees/status/830070242659414016 (I guess that rio returns a data.frame or tibble, so making fread do it is perhaps clearer as people use fread to return data.table.)

MichaelChirico commented 5 years ago

I think a separate package like rio is better suited for this -- minimally, a package of simple wrappers to communicate among data.table and other non-CSV data types (e.g. Parquet https://github.com/Rdatatable/data.table/issues/2505). Worth considering to build such a package maybe as an add-on within Rdatatable org but not in core data.table.

Of course feel free to re-open if you think otherwise @mattdowle :)

mattdowle commented 5 years ago

Yes I agree in general rio is better place for this. But the wrinkle is that rio returns a data.frame. So it's inconvenience to users who like fread returning a data.table by default. I got the impression that was the gist motivating Bernard's tweet.

DrMaphuse commented 1 year ago

I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files.

The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets.

grantmcdermott commented 1 year ago

I'm in a similar boat to @DrMaphuse. Our entire stack is all parquet+arrow based and data.table sits a little awkwardly alongside this. Given the architectural differences and how the data are represented in memory, my guess is that some conversation penalty (whether setDT or otherwise) is unavoidable.

OTOH it would be great to be able to use data.table's [i, j, by] syntax on arrow tables, i.e. as an alternative to the current dplyr frontend. This would allow you to keep the out of memory features of arrow (efficient subsetting most obviously) and reduce the cognitive overhead from switching syntaxes after you do bring a dataset into memory. It probably requires a separate (arrow.table?) package, though.

shrektan commented 1 year ago

I would really love to see more support for this. We see ourselves increasingly moving away from R simply because our preferred dataframe package (data.table) needs to use setDT before working with parquet files.

The way things are now, freading .csv files is faster than reading and converting parquet files, despite the latter being the superior format in every aspect. In the age of delta lakes and similar modern data science infrastructure, this feels somewhat anachronistic and it represents a significant bottleneck when working with large datasets.

I haven't used parquet before. When you mentioned setDT() is slow, do you mean that setDT() has to read the data from the disk to the memory, which is time-consuming? Otherwise, I'm confused that when reading into the memory, R has already converted the data to SEXP, and setDT() should be super fast as it only bookkeeping some meta info there.

DrMaphuse commented 1 year ago

When you mentioned setDT() is slow, do you mean that setDT() has to read the data from the disk to the memory, which is time-consuming?

No, the data is already in a data.frame when setDT is applied. And this is indeed slow, at least when compared to fread or just reading a parquet file into a data.frame.

BUT, setDT isn't always necessary, and this is where it gets interesting:

When writing a data.table to disk using write_parquet or write_feather, it can be read back in using read_parquet or read_feather, and it is instantly recognized as a data.table.

> test_dt <- data.table(c(1, 2, 3, 4))
> test_df <- data.frame(c(1, 2, 3, 4))
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE
> test_dt %>% write_parquet(x = ., 'test_dt.parquet')
> test_df %>% write_parquet(x = ., 'test_df.parquet')
> test_dt <- read_parquet('test_dt.parquet')
> test_df <- read_parquet('test_df.parquet')
> is.data.table(test_dt)
[1] TRUE
> is.data.table(test_df)
[1] FALSE

However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference. It would be super cool if we could use this to generate data.table-parquet files with other tools.

ben-schwen commented 1 year ago

@DrMaphuse could you profile what part of setDT takes time? I do not know the details of read_parquet but it could use lazy indexing similar to vroom.

Not using setDT ends up being pretty much the same as just manually setting the class to data.table. You can see this by calling data.table:::truelength(test_dt) or data.table:::selfrefok(test_dt)

DrMaphuse commented 1 year ago

I tried profiling but I'm not too familiar with the profiler in RStudio. Does this answer your question?: grafik

eitsupi commented 1 year ago

However, this doesn't work when reading files that were created with other tools. I've been trying to figure out how this works, but I couldn't find anything in the files' metadata that would indicate a difference.

This is simply writing back the R attributes when converting from arrow::Table to data.frame. The attributes are stored as metadata in the Parquet or Arrow file.

The same results can be reached by manually setting up the following procedure.

> tbl <- data.frame(x = c(1, 2)) |> arrow::arrow_table()

> tbl$metadata$r$attributes$class <- c("data.table", "data.frame")

> arrow::write_parquet(tbl, "test.parquet")

> library(data.table)
data.table 1.14.6 using 8 threads (see ?getDTthreads).  Latest news: r-datatable.com

> arrow::read_parquet("test.parquet")
   x
1: 1
2: 2

You can check the metadata of this file with pyarrow, for example.

>>> import pyarrow.parquet
>>> md = pyarrow.parquet.read_metadata("test.parquet")
>>> md.metadata
{b'ARROW:schema': b'/////4gBAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAABQBAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAADAAAAAEAAAByAAAA4gAAAEEKMwoyNjI2NTgKMTk3ODg4CjUKVVRGLTgKNTMxCjIKNTMxCjEKMTYKMgoyNjIxNTMKMTAKZGF0YS50YWJsZQoyNjIxNTMKMTAKZGF0YS5mcmFtZQoxMDI2CjEKMjYyMTUzCjUKbmFtZXMKMTYKMQoyNjIxNTMKNQpjbGFzcwoyNTQKNTMxCjEKMjU0CjEwMjYKNTExCjE2CjEKMjYyMTUzCjEKeAoyNTQKMTAyNgo1MTEKMTYKMgoyNjIxNTMKMTAKYXR0cmlidXRlcwoyNjIxNTMKNwpjb2x1bW5zCjI1NAoAAAEAAAAUAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAEDEAAAABgAAAAEAAAAAAAAAAEAAAB4AAYACAAGAAYAAAAAAAIAAAAAAA==', b'r': b'A\n3\n262658\n197888\n5\nUTF-8\n531\n2\n531\n1\n16\n2\n262153\n10\ndata.table\n262153\n10\ndata.frame\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n5\nclass\n254\n531\n1\n254\n1026\n511\n16\n1\n262153\n1\nx\n254\n1026\n511\n16\n2\n262153\n10\nattributes\n262153\n7\ncolumns\n254\n'}