ATFutures / calendar

R interface to iCal (.ics files)
https://atfutures.github.io/calendar/
Other
38 stars 10 forks source link

Create ic_dataframe() #4

Closed Robinlovelace closed 5 years ago

Robinlovelace commented 5 years ago

I think going from list -> datframe makes sense but cannot be sure.

This answer outlines how to coerce that into a dataframe (I suggest we do this after filtering out the properties we need such as VEVENT: we no know longer need that because we know they're events already - could have written ic_list() to omit those lines but better to omit them explitly in the next function, e.g. called ic_daatframe() (benefit: more explicit that ic_df()): https://stackoverflow.com/questions/15201305/how-to-convert-a-list-consisting-of-vector-of-different-lengths-to-a-usable-data

Robinlovelace commented 5 years ago

Any input on this approach especially welcome, heads-up @mpadge and @layik

mpadge commented 5 years ago

Disclaimer: I've yet got any sufficient idea what this package is doing to be offering any kind of informed opinion here.

But and nevertheless. My one thought would be to be careful doing this. It is surprisingly easy to coerce complex lists (of lists of lists) into simple (-looking) data frame columns, but this comes at the single and very important expense of computational efficiency. These only look simple because of the default print method, which compacts those list items (and/or equivalent for tibble objects). However, any operations on those columns have to unpack the (potentially nested) lists, and that is still a very inefficient operation. This is what makes lots of sf operations quite slow, and is the reason why sf needed and now has a wealth of carefully hand-crafted geometric operations. These are all done in C++, in which this kind of unpacking, dissembling, re-assemling is reasonably efficient, but in R it remains strikingly inefficient.

Jim Hester gave a great talk about his glue package at useR, half of which was about efficiency and the ease of sticking glue operations in any pipeline. We should definitely have the same mindset here, and giving users the impression that columns are neat and simple when in reality they are actually nested lists is generally highly inefficient. Probable appropriate summary here would be standard tidyverse evangelism. Each column is one variable; each row one observation. Very easy to achieve optimal computational efficiency in that case.

Robinlovelace commented 5 years ago

Computational efficiency is not really a concern because ical files tend to be tiny. And if we were to want to optimise it having found some large ical files, e.g. due to huge ical files spewed by gtfs feeds, then we should do that after the functionality is there, according to this quote from Donald Knuth:

“premature optimization is the root of all evil (or at least most of it) in programming” (Knuth 1974).

Robinlovelace commented 5 years ago

But any suggestions on how to get the functionality working v. welcome. Thinking one line per event to coincide with the default output of ic_list().

Robinlovelace commented 5 years ago

But definitely in favour of making our data 'tidy' as defined as data frames in which:

Each variable forms a column.

Each observation forms a row.

Each type of observational unit forms a table.

Source: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Robinlovelace commented 5 years ago

Starting with events as rows it would be hard not to follow this definition - eminently sensible to consider though and any other thoughts v. welcome.