Settle on information storage schema

ProfFancyPants commented 6 years ago

So I wrote a bit long message here about an hour ago, and I walked away from the wireless connection and then submitted it. Nothing could have saved me, not even the back button. hehe

So I looked at your basic and count examples, and I what to get a sense of how you would want that information stored in a PERFECT WORLD, assume no R limitations, etc.

Since you mentioned people will be maintaining this information, it might be a good idea to have it in the easiest to manage way. If the file system you have currently is the best, then so be it.

Also, I want to get a sense of what objects inside the annotation.R files change. Do the functions names change, etc? I see where you might be having troubles regarding environments. It is advised to use tidy functions within functions. They are mainly for interactive scripting and don't work well non-interactively. Unfortunately, a lot of R people have lost their ability to program any other way so they are kind of stuck like that.

We can still fix the stuff so you can use tidy the way you want to, but it will be like adding a bandaid.

grosscol commented 6 years ago

Sorry it took me so long to respond to this.

Since you mentioned people will be maintaining this information, it might be a good idea to have it in the easiest to manage way. If the file system you have currently is the best, then so be it.

The performance data is generally csv. Data is expected to be provided as a file. There is a single case where there is a mysql instance to query for performance data, but that is upstream of this project.
Annotation functions that sort out an attribute of a performer are expected to live in a file usually named annotations.r
The spek.json is currently stored to disk and read, but may be provided by piping on stdin.

grosscol commented 6 years ago

sense of what objects inside the annotation.R files change.

The annotation functions are specific to each client or collaborator. What is a meaningful has_gap to one collaborator in their performance data is not expected to be applicable to the situation and performance data of any other collaborators.

That being the case, annotations are expected to be different per each collaborator. The function naming convention and parameter signature of the functions will be the same between different annotations. E.g. "annotate_some_attribute"

grosscol commented 6 years ago

It is advised to use tidy functions within functions. They are mainly for interactive scripting and don't work well non-interactively.

I assume you mean "inadvisable to use tidy functions within functions". Why is that the case?

ProfFancyPants commented 6 years ago

There is a single case where there is a mysql instance to query for performance data, but that is upstream of this project.

When it gets to R, is that information stored in memory?

Annotation functions that sort out an attribute of a performer are expected to live in a file usually named annotations.r

I am trying to get at the ought, not the "are." Avoiding convoluted initial data storage will save you a world of complexity in the long run if you can help it.

What is a meaningful has_gap to one collaborator in their performance data

So, it sounds like has_gap is a boolean, and the size of the gap might be a number. When I ask about what these functions are attempting to do, I am wondering if these are doing something similar enough between functions that functions do not have to change, if you simply load them with anyone Dr.'s info. For instance: Dr1 gap 2 Dr2 no gap 0 Dr3 gap 3

Maintain and updating the information in table form, and employing generalized functions to implement that information is WORLDS better than digging through code when something needs to be changed.

I assume you mean "inadvisable to use tidy functions within functions". Why is that the case?

Essentially, you have first has knowledge as to why that is the case. Also, there is a short discussion here https://a2mads.slack.com/archives/C44LRR43A/p1535660848000100

In general, R is a high-level language. There are so many different ways of doing the same thing syntactically, and a large proportion of the code was written by non-programmers. These create a unique problem for R especially regarding its reputation for being slow -- which it doesn't have to be. It fights between two worlds: being a surface scripting language, and a programming language. The issue here is that because of the diversity in R's syntax many things can be done to make R scripting easier for a layman to do munging, however, the tradeoff here is that high level -- usually -- tidy functions are extremely slow, and are horrific to employ using "non-standard evaluation" or language calculation. Other scripting languages know that for the fast majority of the time they are meant to call scripts/instructions but aren't meant to be the functions themselves (where they would be written in C or something). R has an identity crisis, and the vast majority of people who use R really have no sense of what I am talking about. They employ heavy-handed functions, for very simple things, and have no sense of how the 4 function environments work, which gets them into impossible to fix situations. They can get into deep trouble when they attempt to move from something that looks like script automation to actually programming functions.

grosscol commented 6 years ago

When it gets to R, is that information stored in memory?

The performance data in that case gets dumped as a csv either into a named pipe or a file on disk.

I am trying to get at the ought, not the "are." Avoiding convoluted initial data storage will save you a world of complexity in the long run if you can help it

The annotations ought to live as functions in a file on disk. An annotation function takes performance data and returns a table with two columns (id, annotation_name) that are string and boolean respectively. Annotations are not expected to be reused between clients nor situations. They are complex enough that storing and interpreting a simple config wasn't cutting it. For example, Client A 2017 defined has_gap as "when the most recent duration was more than 25 minutes", while Client B 2018 defined has_gap as "when the mean documentation rate from the most recent 5 timepoints is less than 90% of that of the other clinicians".

Essentially, you have first has knowledge as to why that is the case. Also, there is a short discussion here https://a2mads.slack.com/archives/C44LRR43A/p1535660848000100

I could not follow that link in my browser. Was the point of that discussion that quosures make code more difficult to reason about?

In general, R is a high-level language. There are so many different ways of doing the same thing syntactically, and a large proportion of the code was written by non-programmers. These create a unique problem for R especially regarding its reputation for being slow -- which it doesn't have to be.

Speed is not an issue for this project until it becomes one. We'd like to avoid the pitfall of pre-optimization. If it does become an issue, writing benchmarks will be the first step in identifying and refactoring the slow code.

It fights between two worlds: being a surface scripting language, and a programming language

What is the difference between a scripting language and a programming language here, and why is it relevant?

ProfFancyPants commented 6 years ago

They are complex enough that storing and interpreting a simple config wasn't cutting it.

Well, as long as you have already considered the simpler route and have found it lacking, then I am convinced. :)

We'd like to avoid the pitfall of pre-optimization.

I am not really concerned about speed in this context either. Fault tracking, defensive programming, readability, and procedural contiguity are much more important I think. Speed can sometimes be helpful for testing/development purposes though. I find if the process takes a long time (like a minute or more) to get to a point where an issue happens, then things can get more annoying when trying to track down what it is going on.

We'd like to avoid the pitfall of pre-optimization.

Pre-optimization like in using C++ right off the bat? Eww weird. No, I prefer modulizing/breaking down the processes into chunks and getting component at least semifunctional before I start expanding the scope/utility/robustness and adding all the details into each part of the process.

What is the difference between a scripting language and a programming language here, and why is it relevant?

The functions to accomplish your goal don't really exist or are hidden in some undocumented hidey-hole deep within github. The function environmental management of which you have described is very straightforward R, but not something that 95% of R users really ever need to do. So you will have to program functions that work well together and are not assumed to be implemented in an interactive way.

ProfFancyPants commented 6 years ago

I had the thought last night that there might be some confusion. The functions/procedures employed within the annotation functions are mostly unrelated to this discussion. From what you described, others will be sorting out what goes into those functions, and regardless of what tools they use to do so the point of the system is to apply those functions without issues.

Also, this unresolved issue is an example of one of the ways "tidy" implementation can cause the sort of issues you had to deal with: https://github.com/tidyverse/magrittr/issues/171 https://github.com/tidyverse/magrittr/issues/38 https://github.com/tidyverse/magrittr/pull/70 I am hoping that one day they resolve it. Honestly, I know how they could do it, but it is weird to me that they haven't tried it.

Display-Lab / bit-stomach

Settle on information storage schema #7