Feature: Add data transformation language and UI

hhpmmd commented 3 years ago

Consider the following two scenarios:

You track your weight and you want to have an overview over how much weight you gained / lost each week/month. This is not possible since you only have the absolute weight values. What you would need is the data on how the weight changed each time you track it.
You track how long you work on a project (work one hour -> track one hour) and you want a plot how much time you've spent on the project in total over time. What you would need is to add the changes in value together to get the total value over time.

Mathematically speaking sometimes you want the derivative (1) and sometimes the anti derivative (2) (i hope i use those right).

Solution: Computing the data should be easy. For (1) subtract the previous tracked value from the new one, and for (2) add a sum of the previously tracked values to each value. Since the data can be generated on the fly from the original data, no additional data has to be saved in a database and there should be no backward compatibility issues.

There are different ways to have a user enable this data. I thought about having a check mark for numerical / time tracker:

added_checkboxes

Checking a mark would then lead to an additional entry when selecting data to plot, similar to how time-duration data has multiple entries for time-duration, hours, minutes and seconds.

I'm open to feedback, especially when it comes to wording (i'm not a native english speaker).

I think i can probably implement this myself, though maybe some hints as to where I would likely have to change stuff would probably speed things up.

SamAmco commented 3 years ago

You are absolutely right that this is a weakness of the app as it stands. I would like to broaden the set of data attributes that it is possible to graph. Differentials and integrals are a good start, but there are also things like tracking frequency or number of data points tracked (which could be useful for naturally un-ordered data like: 0:Oranges, 1:Apples, 2:Pears etc..), and then there's the time of day something's tracked or the time between tracking. I'm not saying you need to implement these but what ever solution is used here should scale well for other data attributes.

However I think I would prefer that this was not added to the tracking side because:

I think it is confusing for first time users. Bare in mind that this is one of the first screens a new user sees and too much detail might scare them off.
If I want to draw a graph of the differential of my data, I would need to edit the tracker to allow differentials first and then go back to create my graph of the differential. This kind of goes against the ethos of Track & Graph. If the data can be calculated for anything, why should i have to enable it when setting up the tracker. I want to be able to iterate quickly when setting up graphs and trying to gain insight into my data.
There is a history view for the tracked data. If the user checks that they care about the differential, do we need to add this data to the history too? If not I think users might be confused, but if we do we are reporting data insights in the tracking side which we really want in the graphing side.

Since all of these transformations can be computed on the fly without needing to change the database structure at all, we should be able to put these options almost anywhere.

My first idea was to add this option to the graph side. So when a user selects which data set they want to graph, instead of bringing up a big list of data sets we bring up a UI that lets you select both a data set and a data attribute. But over time I have gone off this idea. I think this approach creates some level of technical debt because ultimately there are lots of similar problems I would like to solve. Fundamentally there are lots of different ways that people want their data to be transformed before it is graphed so combining the transformation and graphing stages begins to get exponentially more complicated as you try to support more and more users requirements. How do you combine totals and averaging and differentials and offsets and scalars? What happens when someone wants an option to tweak the order of operations? Trying to support too many things in the graphing interface will require countless new check-boxes and drop-downs that would just make the app unusably messy for most users.

My current best idea is more involved but it is something I have been thinking about for a while which has the potential to solve all sorts of problems at once. Here are some examples of people asking for new ways to transform/combine their data:

https://github.com/SamAmco/track-and-graph/issues/7 https://github.com/SamAmco/track-and-graph/issues/49 https://github.com/SamAmco/track-and-graph/issues/61

I have many more in emails. So for example a request I get a lot is that users want to be able to see data combined from various sources. e.g. If I track wine, beer and whiskey separately how do I draw a graph of alcohol in total. What needs to happen is that data transformation should be separate from visualisation (so technically it should be Track&Transform&Graph). So what I would really like to do is to add another top level menu screen called "Functions" (alongside Home, Reminders, Notes, etc.. ). The functions screen would again be a list view with a plus button in the top right which would create a new function. The function would take a name (the name of the output data) and a text input that represents a transformation of one or many input data sets. I haven't thought through how this would work yet but it could allow you to create entries something like:

Alcohol = daily_sum((whisked 3), (wine 2), beer) Weight changes = differential(weight) Weight changes weekly = differential(total(weight, WEEKLY))

etc. etc.

I'm sure you can appreciate the power here but a few of the advantages I see are:

Once the base implementation is there it would allow Track & Graph to expand in functionality quite quickly and easily as it wouldn't require lots of new UI and design work for every new specific request.
It keeps advanced functionality away from basic functionality which will hopefully avoid overwhelming new users and also allow the app to support more advanced users at the same time.
It allows fine control over order of operations.

You could then select any of the data created by a function in the data set drop down when creating a graph/visualisation. So each function must simply output 2 dimensional data with a value and a timestamp as if it were any other tracked data.

This approach is obviously much more involved and requires the development of some composable expandable language interpreter. I have never personally written anything like this so I'm not entirely sure how much work would be required. On the plus side it's much more powerful and kills many birds with one stone.

One final note is that either way the documentation will need updating. There is an FAQ in the app that a lot of new users find helpful so it needs to be kept up to date and it must document well exactly how everything works.

Perhaps this has put you off but I was also hoping maybe it would inspire you :P .. I have no real time for this project now unfortunately so I will not be implementing this any time soon, but if you're interested I will support you in any way I can.

hhpmmd commented 3 years ago

Yeah, I also had a basically very similar idea of having these programmable trackers and I think you are right in the conclusion that in the end this might be the only logical next step in regards to these issues.

I did some minimal research and found that having an interpreter in an android app has been done before, so it would be possible to do and I also think writing the interpreter itself (parsing and computing the input from the user) is an interesting challenge that I could probably motivate myself to do.

However I think there is probably also a lot of UI and some internal stuff to be done which a) I'm not sure how much work it is, b) might be harder for me to get into/I need more understanding of the underlying architecture and c) I'm historically less motivated to do.

Can you give an estimate of how much work you think adding the new UI stuff / integrating everything into the system might be? And optimally if you could see yourself doing some ui/integration work in the near future, if I were to implement an interpreter?

SamAmco commented 3 years ago

Well if you have the motivation that would be awesome. I think it really comes down to the specifics of the proposal.

Wrt integration: if all we have is a new table in the db for functions that just contains ID, name and functionText (or something simple like this) and then from the graphing side we access the data the exact same way as we do any other feature (i.e. there is a layer of abstraction that returns the data for any feature or function) then most of the work should be in defining/writing/documenting the language/parser. I can try and give you pointers on where to look for stuff architecture wise.

I'm sure the UI wouldn't be too much trouble for me to do at that point. I can only really find a few hours in a week for this but even still I envision UI to be quite minimal.

I would be interested in seeing what you've found re interpreters in other apps?

hhpmmd commented 3 years ago

Okay so I found this blog post ( https://tomassetti.me/jariko-an-rpg-interpreter-in-kotlin/ ) which uses https://www.antlr.org/ which appears to be a tool that can parse 'any' grammar you design. Since the one needed here should be pretty simple function name and arguments in brackets for most if not everything) there probably is already some grammar which only needs minor changes. Since it's so general this would be my first thing to try to make work.

There are also some projects in which the grammars are hard coded:

https://github.com/Javier-Barrio/klisp - a lisp interpreter written in kotlin, nice bc there's not much boilerplate code, not sure how easy it is to switch from the lisp grammar to something else manually
https://codeberg.org/kollo/X11-Basic - a basic interpreter already in an app, but honestly looks way to complicated to adapt.

So I think the blogpost and the first project are two good starting points.

I haven't checked (and it isn't really my expertise) regarding the licenses and if they are compatible with this project. It would be nice if you could give me a heads up whether either source is ok to use.

SamAmco commented 3 years ago

Nice work, I will try and take a look at them all soon and get back to you.

SamAmco commented 3 years ago

Antlr looks like it's probably a good idea. The license: https://github.com/antlr/antlr4/blob/master/LICENSE.txt is pretty permissive so that shouldn't be an issue as long as we include it in the project.

Before we get into implementation I think we need to find good examples of simple data transformation languages to draw on. We want something very simple and elegant that most users can learn easily but with enough flexibility to allow us to expand on it. I will try and think about some of the requirements and desired functions and collect a list in this thread probably this weekend. After that I will try to collect some good examples of simple languages/grammars that might be best in this thread also.

SamAmco commented 3 years ago

Data definition

First I would like to define the data we expect to be working with in any/all functions. All data sets are a list of data points ordered by time from oldest to newest where a data point is an object containing:

A timestamp
A value (Always numerical although the number can represent different things such as time)
A label (Optional text information)
A note (Optional text information)

Any input or output data set will be either:

Regular: data that is sampled periodically with the exact same amount of time between each data point
Irregular: data with varying amounts of time between the data points.

(Note that no data can be considered regular unless it has been transformed by a function that declares its output as regular. Note also that regular data must be associated with a period and that not all functions can take multiple operands that are regular but with different periods.)

In addition any input/output data will have one of the following types:

Time duration data
Numerical data
Text data: e.g. notes
Labelled data i.e. multiple choice. This inherits both numerical and text.

Functions

I will list here some of the most commonly requested data transformations or tools to fascilitate users common requests in the most versatile way:

Addition/Subtraction/Multiplication/Division

Input- data: [Irregular|Regular]&[Numerical], n: Number Output- [Irregular|Regular]&[Numerical] Description- The same operation applied to each data point e.g. adding 1 to all data points

Convert time to numerical

Input- data: [Irregular|Regular]&[Time], t: TimeUnit Output- [Irregular|Regular]&[Numerical] Description- Returns the number of time units each data point represents e.g. if the data point has value 00:03:00 and the time unit t represents minutes then the output would be 3

Convert numerical to time

Input- data: [Irregular|Regular]&[Numerical], t: TimeUnit Output- [Irregular|Regular]&[Time] Description- Returns amount of time each data point represents e.g. if the data point has value 3 and the time unit t represents minutes then the output would be 00:03:00

PeriodicTotal

Input- data: [Irregular]&[Time|Numerical], p: Period Output- [Regular<with period p>]&[Time|Numerical] Description- Calculate the total of all data points per period p and return a regular form of the input data with period p.

Moving Average

Input- data: [Irregular|Regular]&[Time|Numerical], p: Period Output- [Regular|Irregular]&[Time|Numerical] Description- For each data point calculate the average of it and all data points prior to it that fall within the time period p.

Time since

Input- data: [Irregular]&[Time|Numerical|Text], from: Optional<List<[Time|Numerical|Text]>>, to: Optional<List<[Time|Numerical|Text]>> Output- [Irregular]&[Time] Description- For every data point output a data point that represents the time since the last data point tracked. If from and to are not defined we simply find the time between each pair of data points. However the function can take two lists of operands (from and to) that allow you to determine the time between given values e.g. for every data point with label "lunch" or "dinner" get the time since the last data point marked "breakfast" or "lunch". If from is defined but to is not then you calculate for each data point that matches a value in from the time since the last data point of any value. If to is defined but not from then you calculate for each data point the time since the last data point that matches any value in to.

Delta

Input- data: [Irregular|Regular]&[Time|Numerical] Output- [Irregular|Regular]&[Time|Numerical] Description- For every data point get the difference in value between it and the last data point tracked

Accumulate

Input- data: [Irregular|Regular]&[Time|Numerical], p: Optional<Period> Output- [Irregular|Regular]&[Time|Numerical] Description- For every data point output a data point that represents the accumulated sum of this data point and all data points prior within the given time period p. If no period p is given then the period is regarded as infinite.

Addition/Subtraction/Multiplication/Division

Input- data1: [Regular<p1>]&[t1: Numerical|Time], data2: [Regular<p2>]&[t2: Numerical|Time] where t1==t2 and p1==p2 Output- [Regular<p1>][t1] Description- For each data point a in data1, find the data point b in data2 with the same time stamp and output a data point that is the result of the operation on a and b.

Filter

Filter out

Open questions:

Data points may have some metadata. Right now I would say that is just the notes field but it is possible that this could be expanded on in the future, for example users have asked to add locations and images to their data points. We may need functions or grammar to allow us to define what specific information about a data set we are interested in when we pass it to a function. For example if the function can take text input are we interested in the multiple choice labels or the notes. Furthermore we want to make sure that we proliferate any metadata to the best of our ability through any function. For example if you just add 1 to all the data points then keeping the notes isn't an issue, but if you calculate daily totals it becomes more difficult. I don't think we need to confront this question at this stage though.

Right now the only function I have defined that converts irregular data to regular data is the Periodic Total. There are probably other desirable functions to convert irregular data to regular data like "last value per period" or "most common value per period" or "mean value per period"

This is obviously written in psuedo code that I made up as I went along so please let me know if this is not clear enough. It would be good to align with you on these functions. Are there any I have missed, are there any that can be broken down into better fundamental functions? I will await your thoughts on this.

hhpmmd commented 3 years ago

Here are some of my thoughts in basically random order:

At first I wasn't sure why there is the distinction between regular and irregular data, but it became clear with the addition and so on between two datasets.
The filter functions maybe need to be adapted, since right now it is only possible to filter by value and not by timestamp or other things, right? Optimally it would be something like data like you defined it and then a condition, which is a function that takes a single data point and returns a boolean. We could offer factories for these functions like filter(my_data, timeIsBetween("Friday", "17:00", "Sunday", "23:00") ) where timeIsBetween could return such a filter function.
For the conversion to regular data something similar probably makes sense, something like aggregate(my_data, Week, median) and have functions for median, average, sum, min, max, latest, earliest and so on.
I think there is a union operation missing, for when i want to join reading_book_a and reading_book_b into reading
Regarding the proliferation of metadata, maybe it makes sense for each new datapoint (coming out of a transformation function) to keep track of who its parents are (so when aggregating all the entries that were aggregated in that specific datapoint). If only one of the parents has a specific metadata, it can be adopted; if more than one have it there could be a separate view where you can view them all.

SamAmco commented 3 years ago

These are all good ideas. I will try to get back to you in more depth soon.

SamAmco commented 3 years ago

In terms of finding other languages to draw on, I'm not sure that what we are trying to achieve here is quite close enough to any existing language to warrant using it. Some of our constraints are as follows:

We want a language that is simple and concise. It is only likely to be used to perform a few small tasks so we don't need lots of complex syntax that might enable other features. For this reason taking other languages may come with added weight that we don't need and confuse the user. I suspect we need a way to declare a data set, reference a data set, call a function, reference an internal constant and accept a hard-coded parameter.
Users are on mobile devices so we want to avoid using any symbols or syntax that is irritating to write on a mobile keyboard as much as possible
Users are most likely to have met some simple syntax like basic excel or google sheets functions but less likely to have met any real programming language. The less programming experience we can assume the better.
The type of data we are working with is quite specific so we can sacrifice some language flexibility for readability.

With these things in mind I have one idea. Suppose a function was composed of lines of the form:

<variable_name> = Function( .variable1, .variable2, CONSTANT)

Where the . symbol is used to reference a variable which may be a variable previously declared in this function or a data set. Constants (like WEEK or MONTH for example) are in caps. My hope here is that we can avoid any use of nested functions. The last line of the function could be simply of the form:

Function( .variable3, ... )

or

.variable3

But in any case the last line represents the final output of the function.

The . and , keys are normally easy to reach on a mobile keyboard which makes this form of referencing easier on a mobile. I also think there will be some UI work to do. For example it is probably necessary to suggest a list of available data sets when the user types a . and I think some syntax highlighting would go a long way. We will also need good error reporting etc which means the interpreter will need a good way of reporting back to the caller.

Some examples:

Union(.reading_book_a, .reading_book_b)

total_per_month = PeriodicTotal(.distance_run_per_day, MONTHLY)
Delta(.total_per_month)

filter_lunches = valueIs("Lunch")
filter_free_lunches = valueIs("Free lunch")
lunches = Filter(.meals, .filter_lunches)
free_lunches = Filter(.meals, .filter_free_lunches)
lunches_daily = PeriodicTotal(.lunches, DAILY)
free_lunches_daily = PeriodicTotal(.free_lunces, DAILY)
free_lunch_ratio_daily = Divide(.free_lunches, .lunches)
.free_lunch_ratio_daily

And so on. Does that make sense? Do you see any issues with this syntax?

hhpmmd commented 3 years ago

Regarding the use of a custom language: I mean the language you describe is very simple and probably not that difficult to port to antlr, the other approach would be to take a language with way more features and just not handle the things we don't want. So, just for example, if we ignore the . notation and just pass datasets by name only, I don't see why we can't use (again for example) a python parser/lexer and just use a subset of their functionality and report errors if the entered code is outside that subset. We don't even have to tell the users that we are using the python parser/lexer so they don't get false expectations, and we don't have to write our own grammar in the end. However it will make extensions to the language very easy / very hard regarding if they can be parsed by the (e.g.) python parser or not. I'll probably do some experiments regarding this during the weekend.
so the dots represent references, but i'm not sure if there are cases where we pass un-referenced data to functions. it is nice for the autocompletion though.
How do we deal with division by zero? Nan/infinity? How do we plot such values?
datasets don't have a fixed name right now. I guess we would have a UI element mapping datasets to variable names above the editor or something like that?

SamAmco commented 3 years ago

Since you're implementing the language I don't mind too much if you want to take it in your own direction with the given considerations in mind, the above was just an idea. However the advantage of having some kind of prefix character to the data set name (known in the code as a feature) is that feature names can be any kind of non-sense that might mess up the parsing so you will no doubt need to have some kind of escape sequences to allow users to specify features by name. In contrast if feature names are always prefaced by a . for example then as soon as the user types the . we can present them with a list of features to select from and probably even behind the scenes we can use the feature ID in the functions text rather than its name (and just display its name to the user). Does that make sense?
Can you give an example of un-referenced data? I'm not sure what you mean exactly?
Good question. This is not really an issue i've had to deal with in the code so far so I'm not sure but that's probably a detail we can solve when we get there. Probably we just return an error in the function UI so the user doesn't have to create a graph to realise it's broken.
The only thing unique about a feature is its ID so like i say maybe we use ID's under the hood and just display the names (with some kind of highlighting) in the UI.

amiguet commented 3 years ago

Hi there,

I just discovered track-and-graph and I like it very much! However, I quickly looked for a way of processing the data before graphing it and I ended up here. I think that would be a great enhancement of this (already great) app.

I just wanted to suggest an alternative approach to this problem. Instead of a textual language like suggested above, it could be possible to use a graphical language in the spirit of scratch or puredata.

So something like

Union(.reading_book_a, .reading_book_b)

could look like

union

And constants like MONTHLY above could simply become a dropdown menu of the PeriodicTotal box.

I'm a big fan of textual languages and usually don't like graphical ones... on the desktop. However, typing code on a mobile device is really cumbersome.

In addition to save much typing, such an approach would avoid syntax errors and allow to prevent most of possible errors in formulas (only allow to connect boxes when that makes sense). It might also be a bit easier for non-programmers to use. And some graphical languages have boxes with several outputs to solve the "what specific information about a data set we are interested in when we pass it to a function" question.

It would probably require a little bit more code to make it work, though, although there might be libraries out there that could provide a significant part of what's needed.

Anyway, I don't have the time or the skills to implement it myself, so that's just a suggestion. I would appreciate a data transformation language, whatever form it takes. But as there is a discussion about the best way to dot it, I thought I might add my 2¢...

SamAmco commented 3 years ago

Hi @amiguet .. I appreciate that this could be a superior user experience however it's simply not really feasible with the time I have to dedicate to this right now. The advantage of a language is that it can be developed, iterated and modified much more quickly. I suspect it's likely that most people won't need to set up too many functions so it probably won't be too much of a burden to the user. In any case I don't really have time to work on this at all, I am relying on @hhpmmd who said above that they are less comfortable with the UI side of things, so a UI heavy solution is not likely for now.

amiguet commented 3 years ago

I completely understand. If I was to develop such a feature, I would definitely select the language version for the same reasons. I just wanted to make sure that this is a conscious choice and not a default solution. Anyway, many thanks to @hhpmmd to consider implementing this, that would make a great feature!

hhpmmd commented 3 years ago

I agree with the points mentioned. I am vaguely familiar with scratch and looked at some images of pure data. To me they look like they would be hard to manage without a mouse. @amiguet are you aware of any graphical programming interfaces that are designed for mobile/phone usage? I guess you could do some recursive UI where there is only ever one node displayed and you ways to navigate trough the node tree. Anyway in the end I feel like it's not really an either or, but rather you would have a textual language and then a graphical one that builds upon the textual one (or at lease the grammatical structure of it). I also hope that it's possible to do a lot with suggestions / auto-complete to guide the user.

hhpmmd commented 3 years ago

As an update from my side: I did some experiments and I think it is definitely necessary to do a custom antlr grammar, but that also seems to be a lot easier than I thought. I'm also busier than I thought atm but I should be able to finish this by mid april. I also found this project https://github.com/massivemadness/Brackeys-IDE which supplies an editor with suggestions and highlighting etc. The info from the readme looks good, haven't tested it yet though.

SamAmco commented 3 years ago

You make a great point that we would need a language anyway under the hood. Brackeys IDE looks like an excellent find. Sounds like you're doing good work, don't stress too much about how much time you have, it will probably take me a while to get the UI and docs side done too. If you want to develop this together incrementally then what probably makes the most sense is for me to create a branch for this feature and we both create PR's for that branch and review/merge when we have time. I think we probably want to break the work up into chunks but also don't want to have long running development on master for this.

Let me know when you're ready for that, but again no rush.

hhpmmd commented 3 years ago

Yeah, having a separate branch on which we both work on sounds like the best way to move forward. When you find the time it would probably make sense for this branch to have a different starting activity with basically just a text input and output field which then allows for some easy prototyping in regards to adding more UI elements and so on. I guess that would be the way with the least upfront UI work.

SamAmco commented 3 years ago

Ok I will take a look at some point.

hhpmmd commented 2 years ago

Hi @SamAmco uhm :point_right: :point_left: if you were to create a branch called feature/data_transformation, i think i have a some code I would like to push on there :flushed:

SamAmco commented 2 years ago

Hey :) created that branch for you

dummifiedme commented 2 years ago

Does this issue concern with data operations like

Plotting: Tracker1 - tracker2

Or is it something else?

I was wondering if we could plot basic operations between the variables (trackers) in a plot.

SamAmco commented 2 years ago

@dummifiedme yes that is the idea

zanovis commented 1 year ago

Is this feature in development currently?

SamAmco commented 1 year ago

Not just yet. Honestly I want to do it but I want to do it well. There's a huge amount to consider here, it's a very complex feature. Currently I'm working on other features and bug fixes trying to get the basics good before working on this feature as I expect most people won't even use it. When I do implement it though I'm actually thinking that going with a UI driven approach rather than a text driven one might be better. Since I'm now using Compose for UI I think this might be do-able.

I expect this feature is at least a year if not more away so please don't hold your breath. If you want more powerful analytic abilities in the meantime you might want to set up a system where you export a CSV to a google drive and have a spreadsheet that runs statistics for example. Depending on how technically capable you are there's really nothing stopping you. When you do a backup you are just exporting a sqlite database. You can write sql queries around that data or get clever with python and scikit learn or even go full on machine learning with pytorch or something.

For these reasons I would say something as simple as automatic backup is a higher priority than this. There are many features which I don't feel I can let people wait for while I work on this, but I will get there eventually.

SamAmco / track-and-graph