fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
618 stars 42 forks source link

Documentation: add explicit comment to write_fst to say what it won't do? #120

Open rgayler opened 6 years ago

rgayler commented 6 years ago

Following on from #12 - write_fst currently only supports classic data.frame and data.table objects with keys. Could you add a comment to the write_fst documentation to explicitly state that other data-frame-like objects are only supported if they can be demoted to a data.frame with atomic column types?- So, list columns won't work at all and writing then reading a tibble with atomic values will get you back a data.frame which must be recast to tibble if that's what you want.

MarcusKlik commented 6 years ago

Hi @rgayler, thanks for your feedback. Indeed, the fst format is language independent in the same way that for example a csv file is independent from the specific programming language that you are using to read or write from it.

The specific wrapper that is used to hold the data-set is language dependent, so can't really be coded in the format. For R, we have the data.frame, data.table and tibble wrappers, which are basically the same structure but with some added syntactic sugar (although a data.table holds an extra internal self reference). But those structures have no meaning outside of the R language.

Therefore, the specific wrapper to use for holding the data is really up to the user. fst makes it easy to select a data.table by using as.data.table = TRUE during a read. The main reason for that is for keys to be handled correctly if they were stored in the file.

Do you think that users are expecting that the same data.frame-like object is returned on read? That would be convenient for local use, but perhaps restricting if you get the fst file from an outside source. Let's say I downloaded the file from Kaggle. When I read it, I would like to be able to select the format to read to and not to be forced to read it as e.g. a tibble or data.frame, what do you think?

The list-column type is to be implemented soon. The idea is to store each element as raw data in the format. An identifier is added to signal the language and version that was used to serialize that element (that would be an ever growing ID list). When reading from another language, those column elements could still be processed as black box containers, but wouldn't have any meaning. So you can sort, subset, copy or delete them, but you can't de-serialize them except when read from that specific language. It's a challenge but the list-column type would allow random-access to specific elements and that would be very useful feature I think!

rgayler commented 6 years ago

Hi @MarcusKlik. Thanks for the great package and your rapid response.

Do you think that users are expecting that the same data.frame-like object is returned on read?

I think "yes" for two reasons.

  1. I think most people have the default assumption with read/write or load/save pairs that what they get back should be exactly what they wrote out.

  2. I think if R users discover fst via CRAN or other R resources they are likely to assume that it is R-centric and if they notice the multi-platform part, still assume that every R-ish aspect will just work.

As far as documentation goes, I think just a sentence or two as a warning that because fst is platform-independent (like CSV) it won't necessarily handle everything that the user might think of as a data frame, and then give a link to what you wrote above (which is a really good explanation) as a vignette.

When I read it, I would like to be able to select the format to read to and not to be forced to read it as e.g. a tibble or data.frame

I think that documentation-wise it's worth putting some more emphasis on the distinction you draw between the contents and the wrapper. That should help R users understand that it's not like save/load but faster because it's restricted to data frames.

Having the ability to read fst to a specific wrapper-type would probably be convenient to have in the language-specific APIs to fst and would also draw attention to the content/container distinction.

The list-column type is to be implemented soon.

Yes - I think that will increase the user base because there are probably many more potential users of fst who are interested in it because of speed, compression, and random access than for the language-independent access. For those single-language users, not being able to use some favoured feature of frame-like native objects is probably an impediment.

MarcusKlik commented 6 years ago

Hi @rgayler, thanks for your reply and suggestions!

Perhaps combining both options would be the most user-friendly. So read_fst returns the same data-set container type (data.table, data.frame or tibble) that was used to write the data. But this default behavior can be overridden by a parameter in read_fst (e.g. _returntype = "tbl")

The extra metadata stored in the format wouldn't have any meaning outside R but it's just a few bytes so it won't hurt too much either :-)

It's a good suggestion to add some vignettes to the package. I'm preparing some blogs about fst and they could be suitable as vignettes as well. Then I could include the disclaimer in those vignettes, thanks!

rgayler commented 6 years ago

Perhaps combining both options would be the most user-friendly.

Yes - I think that's a good idea.

MarcusKlik commented 6 years ago

Yes, I'll have to add the tibble package as an optional dependency (data.table already is). So when reading a fst file with a tibble marker, and no table-type override is specified, the tibble package will be loaded before returning the tibble. The same holds for data.table.

thanks for your request and remarks!

noamross commented 6 years ago

Relatedly, I noticed that attributes (of the whole data.frame, rather than columns) are dropped when writing and reading FST. It makes sense to me that this is so, but perhaps it should be documented, or could this be included as black-box metadata like list elements?

MarcusKlik commented 6 years ago

Hi @noamross, thanks for your question. All the types defined in the fst format should be (programming-) language independent, so that's why attributes like tibble or data.table have no real meaning in the format. Also in the R world, the alternative forms of a data.frame (tibble, data.table) are identical to a data.frame at their core, the difference is (almost) only in the API with which you can access the data.

But it would be very convenient to the user to restore the original form when a fst file is loaded as @rgayler points out, so I think it's a good idea to store a marker for the type of table.

Are there any other table attributes that you think are important to retrieve when the fst file is loaded?

thanks

noamross commented 6 years ago

I was actually thinking of arbitrary table attributes, as in R, object attributes can contain any R object. In my case I was using attributes to attach some lists of metadata to a table. If you chose to support this, I guess it could be implemented as a blob of data in .rds format. But I'd understand if it were out-of-scope.

MarcusKlik commented 6 years ago

Hi @noamross, thanks for sharing that, being able to attach meta-data to the table would be a very useful feature indeed. For example for adding Markdown (or plain text) documentation about the table (see also #127). Complex (R) attributes are more difficult to process however, because the rules needed for combining (or splitting) them are not known to fst.

For example, suppose you would have a min_date and max_date attribute to indicate the period of validity of the table. When you rbindlist() a second table to the first one (to be implemented), fst doesn't know how to add the attributes together. The same problem occurs when fst would serialize custom column attributes. When new data is added (or a subset is made), fst can't possibly know what to do with the attributes without prior knowledge of the meaning (think time-series for example).

So, when meta-data has a clear meaning and the rules for combining and subsetting are known, allowing for additional (column- or table-) attributes would be very nice. But raw blobs are more difficult I think.

Perhaps a solution would be to allow a list of custom attributes to be added (in write_fst()) without actually implementing any logic for combining them. So rbindlist() with a new table would just add more list elements to the meta-data. And subsetting would keep all the elements intact. That custom list won't have any meaning in other languages, so we would need to add a language marker to the list to avoid a consumer trying to deserialize it in Python for example.

thanks!

HughParsonage commented 6 years ago

To add support to the specific case of retaining data.table: since fst is likely to be used for large data objects, it is more convenient to be able to read_fst without having to remember as.data.table = TRUE to avoid the console being hit. I think adding the class of the data.frame to fst::fst.metadata rather than custom metadata would be a suitable solution (perhaps even under a 'language' element if there is no language-agnostic way to do it), e.g.

List of 7
 $ path           : chr "temp.fst"
 $ nrOfRows       : num 9808008
 $ keys           : chr "adm_id"
 $ columnNames    : chr [1:3] "adm_id" "variable" "value"
 $ columnBaseTypes: int [1:3] 4 3 5
 $ keyColIndex    : int 0
 $ columnTypes    : int [1:3] 5 3 10
 $ language_specific:List of 2
  ..$ language: chr "R"
  ..$ class   : chr [1:2] "data.table" "data.frame"
 - attr(*, "class")= chr "fstmetadata"
MarcusKlik commented 6 years ago

Hi @HughParsonage, thanks for your comment!

Indeed, the specific table type that was used to originally store the data is important enough to merit it's own meta-data field in the format. The same holds true for the language, because the specific language is important for the future implementation of the list column-type (list elements will have to be (de-)serialized with language-specific serializers).

Thanks, I will make sure the correct table type is restored with read_fst() in the next release of fst.