fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 42 forks source link

FR: include table and column documentation in FST file #127

Open treysp opened 6 years ago

treysp commented 6 years ago

Thanks so much for all your work on fst - it's incredibly useful!

My question is about the possibility of storing arbitrary object-level attributes in fst files. The attributes would be additions to the list of object attributes fst already stores.

The use case is storing metadata about the file such that it is somewhat self-documenting (e.g., store parameters from function used to create it).

Example of adding attribute "other"; fst does not currently retain it when written/read:

mydata <- data.frame(a = 1)
attributes(mydata)
#> $names
#> [1] "a"
#> 
#> $row.names
#> [1] 1
#> 
#> $class
#> [1] "data.frame"

attr(mydata, "other") <- "this is some useful info"
attributes(mydata)
#> $names
#> [1] "a"
#> 
#> $row.names
#> [1] 1
#> 
#> $class
#> [1] "data.frame"
#> 
#> $other
#> [1] "this is some useful info"

temp <- tempfile(fileext = ".fst")
fst::write_fst(mydata, temp)

attributes(fst::read_fst(temp))
#> $names
#> [1] "a"
#> 
#> $row.names
#> [1] 1
#> 
#> $class
#> [1] "data.frame"

Edit: related to column-specific attributes discussed in #103

MarcusKlik commented 6 years ago

Hi @treysp, thanks for your feature request! Indeed, fst only stores three possible 'attributes' for each column; the base type, a type-specification attribute and a scale. Also, for some type's there is an extra string attribute (such as for time zones).

The reason is that stored types could be read from other languages than R, so they can't be R specific (and special attributes usually are :-)). It would be useful to store documentation in the fst file itself however, for example Markdown documentation per column and for the whole table. Would such a documentation feature be enough to store the information you want to add, or do you require specific attributes to store type-information?

thanks

treysp commented 6 years ago

Hi @MarcusKlik, Markdown documentation per column and for the whole tables would do exactly what I need. Great idea, and thanks so much for fst!

MarcusKlik commented 6 years ago

Nice! In RStudio, we could try to show the documentation in the viewer pane using these instructions. So compile the markdown to a local html file and display that in the viewer.

That way, you can take a look at the table and column documentation in the viewer, I think that might be a very useful feature!

treysp commented 6 years ago

Agreed - thanks so much!

xiaodaigh commented 6 years ago

I plan to add a feature to store metadata about the column in the disk.frame package, so things like data dictionary etc for each column can be stored there. So I think another way to extension functionality is to allow fst to be lean and have a fstmeta package that wraps around all of these things.

MarcusKlik commented 6 years ago

Hi @xiaodaigh, thanks for your comment and I agree that a 'lean' fstlib would leave the most room for building custom implementations on top of it. Perhaps it would be enough to allow storage of a single custom string (or even more general, a raw vector) for each column. The user or a wrapper could use that string for documentation or to store other custom attributes. The nice thing about that is that the extra information is kept in the fst file itself and will always be available when the data is. A flag could be used to indicate the type of information, e.g. markdown, plain text, raw. For custom raw data, the information wouldn't be portable to other languages or implementations obviously...

schelhorn commented 6 years ago

Hi @MarcusKlik, I'd be interested using per-fst-file custom metadata as well. Are there any specific plans in that regard? Thanks!

MarcusKlik commented 6 years ago

Hi @schelhorn, thanks for your request. Stored R specific metadata (special attributes or custom R objects) would not have any meaning in other languages than R. So that information would have to be marked with a specific language marker to signal the type of information stored. But textual information (markdown, plain text, etc.) can be stored and used by all languages, so those are definitely a nice addition (especially markdown documentation).

What kind of metadata would be interesting for you to store in the fst file?

schelhorn commented 6 years ago

For column-metadata I'd store information on what the data in that column means and how it should be interpreted. For per-file metadata, I'd store information on the provenance of the whole data set (by which person/organization and method it was generated and at which date). Encoding this in filenames sucks...

MarcusKlik commented 6 years ago

Thanks, so your meta-data could very well be stored in markdown text attributes attached to columns and the table as a whole. And when the user works with RStudio, it would be nice to be able to display the documentation on the viewer pane. We could show the documentation with:

doc_fst("mytable.fst", "mycolumn")

(for displaying the documentation for column mycolumn). It should be possible to add documentation when creating the table (a named-list argument doc?) and also afterwards:

doc_fst("mytable.fst", mycolumn =
   "# mycolumn \n\nThis is markdown documentation of column `mycolumn`.")

thanks for submitting the feature request!