KarthikRIyer / swiftplot

Swift library for Data Visualization :bar_chart:
Apache License 2.0
400 stars 39 forks source link

Add a heatmap #116

Closed karwa closed 4 years ago

karwa commented 4 years ago

This still needs some cleaning-up. It introduces a lot of new things:

Some things I know I need to work on before this is ready to merge:

karwa commented 4 years ago

Test failures are expected. The plots are now slightly larger because of changes to the way the area is calculated.

odmir commented 4 years ago

I have to say I took a peek last year at your work on this in your fork and I got pretty excited about this!

I think I like LayoutElement better, maybe another alternative could be LayoutComponent? What do you think about Normaliser instead of Interpolator, would it make sense? Since usually when you say you are normalising some data, what you do is map it to values between 0 and 1 or some other definition of normalisation. I like the functional approach, I think though we should eventually simplify the setting of labels and stuff by making things like plotTitle and plotLabel unnecessary to the end user, by using nice setters directly on the graphs.

Collection slicing by filter closure?

What do you mean by this?

Overall I really like it! 🎉

karwa commented 4 years ago

I think I like LayoutElement better, maybe another alternative could be LayoutComponent?

LayoutComponent is nice.

What do you think about Normaliser instead of Interpolator, would it make sense? Since usually when you say you are normalising some data, what you do is map it to values between 0 and 1 or some other definition of normalisation.

I also have a prototype of a bar-graph which is generic to any data. For example, if you have a collection of Department objects, you can make a BarGraph<[Department]> directly and set an adapter for the property you want to graph via key-paths (e.g. \.employees.count).

Removing the constraint that the data is a number and replacing it with an abstract way to get a number, we can allow the graphs to use higher-level data types. This has some really nice side-effects for things like data coordinates and formatting axes markers (because we're now a graph of Departments instead of a graph of Ints).

In order to make this work, every plot that is generic like this will need an adapter. Because of that, I made an Adapters namespace and renamed the type Adapters.Heatmap.

One downside of this generic-ness is that to do things like stacked datasets and multiple series, we need to create enormous chains of generic types. For example:

StackedBarGraph<StackedBarGraph<StackedBarGraph<BarGraph<[Int]>, Data>, [Float]>, LazyMapSequence<(ClosedRange<Int>), MyStruct>>

Nobody wants to write that, so it basically forces us to use the functional-style API. It's worth pointing about that SwiftUI also has this issue, but uses the non-standard "function builders" DSL instead. Maybe that would also be something for us one day.

I like the functional approach, I think though we should eventually simplify the setting of labels and stuff by making things like plotTitle and plotLabel unnecessary to the end user, by using nice setters directly on the graphs.

Definitely. That will be source-breaking though, so maybe it's good if we tag a stable release before doing that.

Collection slicing by filter closure?

This is about the heatmap's functional API. We have one for 2D collections, and one for any collection, which slices rows of a fixed width. This refers to adding another variant, which slices rows according to some line-breaking closure.

I've since dropped this idea. It's good to keep the list of overloads down, so users are not overwhelmed. I'm not even 100% sure about the fixed-width slicing version; it's cool, but maybe it would be better as a general-purpose extension on Collection instead.

KarthikRIyer commented 4 years ago

I like LayoutComponent.

I do like the functional API, but I think that there will be people who aren't comfortable with the functional approach(including me 😕 ). Is there a way to support both kinds of API?

.heatmap(interpolator: .linear) { I find interpolator here to be more intuitive, so I think as long as the argument label remains the same we can use HeatmapAdapter.

KarthikRIyer commented 4 years ago

Ohh... just saw the comments in the code. Would mapping be better as an argument label?

karwa commented 4 years ago

I do like the functional API, but I think that there will be people who aren't comfortable with the functional approach(including me 😕 ). Is there a way to support both kinds of API?

Sure. It would be more code, but I think it's possible. I'm not entirely sure how it would work with plots that compose multiple datasets, like BarGraph, but I'm starting to think it may be doable.

One benefit of the initialiser approach is better integration with SwiftUI. SwiftUI has these protocols NSViewRepresentable and UIViewRepresentable, which allow ordinary objects to become part of the view hierarchy. This means you can do something like:

            HStack {
                Text("SwiftPlot meets SwiftUI").font(.headline)
                Heatmap(values) {
                    $0.plotTitle.title = "\(values.count) values"
                    $0.colorMap = ColorMap.fiveColorHeatMap.lightened(by: 0.35)
                    $0.showGrid = true
                }.frame(height: 600)
            }.padding(10)

Would mapping be better as an argument label?

mapping makes a lot of sense, sure.

odmir commented 4 years ago

This refers to adding another variant, which slices rows according to some line-breaking closure.

That idea is interesting, but could you explain more about the use cases? Like I imagine it would be relatively easy to to inadvertently slice a collection where the resulting rows are of different lengths, what would happen then, can we still make a HeatMap out of that? or am I misinterpreting this?

karwa commented 4 years ago

@odmir rows of different lengths are fine. The use-cases aren't that convincing though, and the standard library already provides this through a .split function, so I removed it.

odmir commented 4 years ago

Ok, cool, lately I've been playing around with your branch because I had an idea I wanted to try out, so now I know in much more detail how the Heatmap works and understand what you mean much better!

odmir commented 4 years ago

Ok so now I'd like to tell you about that idea I had.

I thought the Heatmap handling of different types of data could be improved by instead of depending on SeriesType being two nested sequences, it could be any type conforming to a protocol that I'll call Heatmappable. This protocol would require of conforming types that they provide a width, height and a method that constructs a type conforming to another protocol which I'll call HeatmappableIterator (it does not conform to the stdlib IteratorProtocol). This "iterator" protocol only has one required method that returns the row, column and the value at those coordinates when called. The expected behaviour is that upon instantiation of a type that conforms to this protocol, we can repeatedly request new values and their coordinates until every value is exhausted. There is no requirement on the order the values must be returned, although it should be reasonable that no two values for the same coordinates should be given, but the other way around is ok: for this type to never return values for some coordinates. We would require, though, that every conforming type returns a Float value between 0 and 1.

After having this in place we could create types conforming to these protocols that provide the functionality of "iterating" over 1D data or 2D data. One step further is to move the Mapping related logic from the Heatmap to the types that conform to these protocols that need to interpolate values or get specific properties out of them with KeyPaths.

This approach would have several advantages:

Maybe there are other advantages I didn't think about... I've already said these types must return Float values between 0 and 1, but unless their objective is solely to plug some data type into the Heatmap, most of these types should accept other conforming types as input and expect the received values to also be Floats ranging from 0 to 1.

This opens up some really interesting possibilities and gives us a lot of flexibility:

If there is data that is very expensive to get or if we have a big tree of these types connected and the performance is starting to suffer, we could even do such things as create a conforming type that caches requested squares and that can be plugged anywhere that would make sense into this "tree".

The most commonly used types or "trees" of types could be created by static methods on the data directly like in extensions to the SequencePlots type so that the user has a nice experience using these for the common cases.

Ok I'm sorry if you've gone through all that and got tired of all these shenanigans. I really think this approach opens up a lot of possibilities and I'd like to hear your opinions on this 😁.

As I said before, I wanted to see how this would look like implemented, so I branched from the last commit in this draft and started implementing it just to get an idea. I checked and it draws the same graphs as the tests before, including the SVG's. If you'd like to see what it looks like, it is in my branch: odmir:heatmap_alternate. If you feel like following this approach, feel free to just copy it over to your branch!

@karwa @KarthikRIyer

karwa commented 4 years ago

It's an interesting idea. The thing is, if we reduce the Heatmap down to the most basic things that it needs, it needs a 2D dataset. Even if it used a HeatMappable abstraction instead, it would still ask for 2 dimensions; you would just be using an adapter.

So it turns out these kinds of data transformations aren't really related to the Heatmap itself - they are generic Collection-y things. It's similar to the idea of the filter closure - in theory, you could use that to slice any kind of data in any way... and that's why it's in the standard library. I even think the fixed-width slicing could maybe be promoted to a generic collection algorithm. This is the thing - do we really want to implement all that logic again and again for every 2D plot? Not really.

I'm not sure that the HeatMappable abstraction really gains us much in practice. You can build mosaics and aggregates and everything else using generic collection algorithms:

    let hm = (-20..<20).lazy.map { x in
        (-20..<20).lazy.map { y -> Int in
            return (x * x) + (y * y)
        }
    }.plots.heatmap()
    try renderAndVerify(hm)

At the same time, I do think it's important that we use standard protocols like Sequence and Collection. That means users can directly graph things like Array, but also important types like Tensor (which can also be 2D), and slices of the above types. All of them will already conform to Sequence and Collection. That's not a data storage requirement, really - you can make a trivial wrapper around almost anything and allow it to conform to Sequence. And if there are cool data-layout features that you like in 3rd-party libraries, like mosaics or random distribution generators or whatever, they'll almost certainly also be using the standard protocols.

Plots are not usually drawn repeatedly without the data having changed. Even in SwiftUI, it only triggers when a dataset stored as a @State property changes. I've been thinking about ways to isolate size-independent data, but that's an internal optimisation between the plots and the GraphLayout.

odmir commented 4 years ago

Thank you for opening my eyes... 😅 I was quite blinded thinking that the approach in this draft did not allow for the same things as my approach, I missed that it is trivial to make other types conform to Sequence and thus my performance arguments around reallocating memory and copying over data don't hold true. Actually, after submitting the last comment I rethought about the applications I wrote about and it sounded very similar to stuff you would want to do to things like tensors and similar to operations that CNN's and such already use today! In my branch I created two implementing types that take generic 1D sequences and 2D sequences so any data type that meet those requirements would be usable as is, but given the above realisation, there would be no advantages to my approach. I can see why all these applications actually make much more sense in its own library geared towards data treatment like TensorFlow, they should not at all be in Heatmap itself.

I do think though we should eventually figure out a way to not do unnecessary work in layout every time we want to draw, for applications where the data never changes but the graph is resized.

Anyways, I think this was a good exercise for me to tinker more with generics, PAT's and with general API design and underlying architecture. Thanks for the time and sorry if I made you waste a lot of time reading through my last comment and/or code in the branch!

odmir commented 4 years ago

Designing API's and architecting code is really new to me and I can see that I really need more experience. Thank you for helping me through that, it has been a really interesting experience!

odmir commented 4 years ago

By the way, do you think all the extensions on SequencePlots should be moved to their own file? Right now I think it can get a bit messy with extensions on Heatmap itself on the same long file.

karwa commented 4 years ago

Ok, I think this is ready.

The changes to reference files are because of the new LayoutComponents elements. The plot size is calculated slightly differently, and the markers are calculated more consistently. Also, I added a small amount of padding underneath the label on the x-axis.

Next steps are to refine the API to make certain things easier. For example, consider a model which looks like:

enum Produce {
        case lettuce
        case cucumbers
        case tomatoes
    }

    struct Farmer {
        var name: String = ""
        var harvests: [Produce: Int] = [:]
    }

let data: [Farmer] = // ...

Trying to heatmap this is awkward, because Farmer isn't a Collection (and shouldn't be), but it contains a Collection. You can use a .lazy.map { $0.harvests.values } transform from the standard library, but then you have no convenient way to use the Farmer's name field when formatting axis labels.

Once we figure out a solution for that, we can add formatters for the heatmap's plot markers.

odmir commented 4 years ago

I noticed the x tick labels don't seem to be correctly aligned for the AGG renderer (some issue with the text height calculation in AGG renderer maybe?):

testHeatmap

odmir commented 4 years ago

Or maybe in the future they should be aligned by their baseline?

karwa commented 4 years ago

I noticed the x tick labels don't seem to be correctly aligned for the AGG renderer (some issue with the text height calculation in AGG renderer maybe?):

testHeatmap

That's intentional. See: Heatmap.swift, line 32. That's what you want for some heatmaps (example).

It should be made public so the user can decide, but that can happen in a follow-up PR.

odmir commented 4 years ago

Sorry, I noticed that is intentional, but that's not what I meant. I'm talking about the vertical alignment, I drew the red lines so it is easier to see, this does not happen with either Quartz or the SVG renderer as far as I can tell.

odmir commented 4 years ago

For example, 1 and 2 are one pixel lower than 0 and 3, as well as most of the other numbers: some are one pixel lower than the others.

karwa commented 4 years ago

hmm, good catch. I'll look in to that.

odmir commented 4 years ago

Next steps are to refine the API to make certain things easier. For example, consider a model which looks like:

enum Produce {
        case lettuce
        case cucumbers
        case tomatoes
    }

    struct Farmer {
        var name: String = ""
        var harvests: [Produce: Int] = [:]
    }

let data: [Farmer] = // ...

Trying to heatmap this is awkward, because Farmer isn't a Collection (and shouldn't be), but it contains a Collection. You can use a .lazy.map { $0.harvests.values } transform from the standard library, but then you have no convenient way to use the Farmer's name field when formatting axis labels.

Once we figure out a solution for that, we can add formatters for the heatmap's plot markers.

For this I think as a user you should easily be able to make heat maps in any of the following ways:

// 1. Puts each farmer in their own column and `Farmer.harvests.values` in each row.
Heatmap(data, rows: \.harvests.values, mapping: .linear)
// 2. Puts each farmer in their own row and `Farmer.harvests.values` in each column.
Heatmap(data, columns: \.harvests.values, mapping: .linear)

// 3. Same as 1 and 2 but specifying optional row and column labels.
Heatmap(data,
        rows:         \.harvests.values,
        rowLabels:    \.harvests.keys,
        columnLabels: \.harvests.keys,
        mapping: .linear)

// We extend `Farmer` to conform to a certain protocol `Heatmappable`. This would be useful if we are using `Sequence`s of `Farmer`s a lot to produce `Heatmap`s.
extension Farmer: Heatmappable {
        static var heatmapRows: KeyPath<Self, [Produce: Int]> { \.harvests }
        static var heatmapRowLabels: KeyPath<Self, Dictionary<Produce, Int>.Keys> { \.harvests.keys } // this could be optional
        static var heatmapColumnName: KeyPath<Self, String> { \.name } // as well as this
}

// 4. Since `Farmer` conforms to the special protocol, we know where the data is, along with the labels.
data.plots.heatmap()
// 5. Should do the same as above but swap the rows with the columns.
data.plots.heatmap(swapped: true)
// 6. This should work, overriding the defaults possibly specified by the `Farmer` conformance to `Heatmappable`.
data.plots.heatmap(columnNames: [...], rowNames: [...])
// 7. This would also be possible after conforming to the protocol.
Heatmap(data)

Maybe if we can we should also have default conformances for Dictionary so that we can automatically get to the underlying data.

Maybe I'm missing more ways to use it. What do you think?

karwa commented 4 years ago
// 1. Puts each farmer in their own column and `Farmer.harvests.values` in each row.
Heatmap(data, rows: \.harvests.values, mapping: .linear)
// 2. Puts each farmer in their own row and `Farmer.harvests.values` in each column.
Heatmap(data, columns: \.harvests.values, mapping: .linear)

While I think it is important both row and column-based data can be charted, we need to consider how this would be implemented:

The first option is a generic data-formatting utility, which I would expect from TensorFlow or some Swift version of Pandas, but is likely out-of-scope for Swiftplot. We can leave space for the second option if it appears to be a strong use-case.

// We extend `Farmer` to conform to a certain protocol `Heatmappable`.
// This would be useful if we are using `Sequence`s of `Farmer`s a lot to produce `Heatmap`s.
extension Farmer: Heatmappable {
        static var heatmapRows: KeyPath<Self, [Produce: Int]> { \.harvests }
        static var heatmapRowLabels: KeyPath<Self, Dictionary<Produce, Int>.Keys> { \.harvests.keys } // this could be optional
        static var heatmapColumnName: KeyPath<Self, String> { \.name } // as well as this
}

I don't like this approach because it locks Farmer to only be heat-mapped in one way. What about if there were lots of collection-typed fields on Farmer, and I wanted to make graphs of all of them? Also, it's not convenient to have to conform all your types to Heatmappable, Histogrammable, LineChartGraphable, etc.

Maybe if we can we should also have default conformances for Dictionary so that we can automatically get to the underlying data.

Dictionary is actually really awkward. One thing I hit during my testing: it makes no guarantees about iteration order for its key-value-pairs. Two equal dictionaries might even store their contents in different orders. So it's really bad to try and graph a 2D collection when the last dimension is a dictionary.

karwa commented 4 years ago

I have the sketch of a solution to the "farmer problem", but it needs some work. It shouldn't hold up this PR, though.

I'll look in to the drawing issues.

odmir commented 4 years ago

Yes, I agree, we should move the discussion to an issue or something.

karwa commented 4 years ago

OK, so I've found the source of the weird rendering with the heatmap: sometimes AGG considers the markers to have size 9, and sometimes size 10. This is because some of them descend slightly below the baseline. I believe this doesn't affect the other examples because their markers happen to use longer strings.

We should actually align all the markers by baseline, but we don't calculate or keep that information today. Doing that would improve the quality of our plots substantially, but it would be a lot of work and this patch is already big enough. That can be left as an improvement for the future - actually, Heatmap is doing nothing wrong. It just exposed a bug somewhere else.

By the way, the .swift-format file is from TensorFlow. It happens to match the style for most of this project (4 space indenting), so I've included it and run the formatting tool (only on files created in this PR).

So I think that's all the feedback addressed. @KarthikRIyer, anything to add?

odmir commented 4 years ago

Nice, thank you for the format file as well!

KarthikRIyer commented 4 years ago
var cgColor: CGColor {
            if #available(OSX 10.15, iOS 13.0, *) {
                return CGColor(
                    srgbRed: CGFloat(r),
                    green: CGFloat(g),
                    blue: CGFloat(b),
                    alpha: CGFloat(a))
            } else {
                var tuple = (CGFloat(r), CGFloat(g), CGFloat(b), CGFloat(a))
                return withUnsafePointer(to: &tuple) { tupPtr in
                    return tupPtr.withMemoryRebound(to: CGFloat.self, capacity: 4) { floatPtr in
                        return CGColor(colorSpace: RGBColorSpace, components: floatPtr)!
                    }
                }
            }
        }

Why do we have two methods to get CGColor?

karwa commented 4 years ago

I've started going through the changes and it'll take a little more time for me to complete that. The changes I've gone through till now look really good and interesting! There are probably some more things I mightn't understand because I'm just a basic swift user.

Feel free to ask if there's anything that's unclear.

Also, how do you use the swift-format file? I've used this before https://github.com/apple/swift-format

Yes, it's the same tool; the file is just stores the configuration for the project (so you can just run swift-format -i ./Sources/SwiftPlot/xxx.swift from the project root, and it detects the settings). It's really helpful for me, because I usually use 2-space indentation, but this project uses 4-space. As a result, it's kind of mixed over the project. I think that and the line-length are the only configuration settings that actually change anything for us.

Why do we have two methods to get CGColor?

There's only one computed property, but it branches depending on the SDK version, The first one (with SRGB values) is the only one available on some platforms (e.g. tvOS). But it's also quite new, so older platforms don't have it and have to use the colorspace/components initialiser as a fallback.

KarthikRIyer commented 4 years ago

There's only one computed property, but it branches depending on the SDK version, The first one (with SRGB values) is the only one available on some platforms (e.g. tvOS). But it's also quite new, so older platforms don't have it and have to use the colorspace/components initialiser as a fallback.

@karwa I understood this but was wondering why we needed withMemoryRebound one place and not the other.

karwa commented 4 years ago

@KarthikRIyer The fallback initialiser requires a C-array of components (i.e. a pointer to the first element). See: https://developer.apple.com/documentation/coregraphics/cgcolor/1455927-init

KarthikRIyer commented 4 years ago

Ahh I see. I was thinking about https://developer.apple.com/documentation/coregraphics/cgcolor/1455631-init which is available 10.5+ but didn't notice that the other init was available 10.3+.

karwa commented 4 years ago

Yes, but the main motivation is for iOS, tvOS, etc. The 10.5+ initialiser isn't available at all on, say, iOS 12.

KarthikRIyer commented 4 years ago

Understood @karwa .

I went through the rest of the code and have got a gist of what's going on. Thanks a lot! Merging.

Also do you think it's a good time to mark a stable release (after merging the bracket annotation of course).

karwa commented 4 years ago

@KarthikRIyer yes, I think it would be a great idea to mark a stable release!

There are some areas where we could make the API simpler to use or add features, but not in a compatible way. Having a stable version means existing code can keep working, even if later releases modify the API.

For example, we use Label to represent the axes labels for layout, and they each have an independent size/color, because it gets used in multiple places and they all measure and draw independently from each other. The existing API combines the size and color of the axis labels, so users couldn't make a red x-axis label and a blue y-axis label, or have them each be different sizes if they wanted to, even though we can support that.

Similarly, ScatterPlot could use the new ColorMap, but not in a way that preserves its current API.

KarthikRIyer commented 4 years ago

Great! Let's release once @WilliamHYZhang's PR is merged.

KarthikRIyer commented 4 years ago

https://gist.github.com/KarthikRIyer/72074b291ccf83c07e0ce9579041e477

Here's a draft of the release notes for our next release. Could you'll review it?

Also @karwa @boronhub @Qwerty71 could you please let me know your full names so that I can add them to the release notes? Also if it's fine with you could you share your twitter handles? I'd like to mention you'll in the post for the release.

@anigasan @boronhub @karwa @odmir @Qwerty71 @WilliamHYZhang

KarthikRIyer commented 4 years ago

Also @karwa for the iOS/tvOS/watchOS support, would it be possible for you to provide an image with the plot on all three platforms? The current image just shows iOS and watchOS.

WilliamHYZhang commented 4 years ago

The link seems to be broken on my end.

KarthikRIyer commented 4 years ago

Link updated @WilliamHYZhang.

Is there a way to share draft releases?

WilliamHYZhang commented 4 years ago

Thanks @KarthikRIyer, draft release looks awesome!

areebg9 commented 4 years ago

Hi! The changelogs look brilliant! The project as a whole looks very promising!

My name is Areeb Gani and my twitter is @areebg9.

I'm also planning on finishing the polar plot and other github-related activities soon; I've been busy with many math competitions recently so I haven't found time to do much programming in the last 2 weeks.

boronhub commented 4 years ago

@KarthikRIyer this looks great, I'm really interested to see how this project develops! My nameis Dhruv Baronia.

karwa commented 4 years ago

@KarthikRIyer My full name is Karl Wagner

I'll make a image showing tvOS support later today. Also I could clean up and release a version of the AppKit/UIKit/SwiftUI bindings.

I think it's better to keep those bindings separate for now, because they're still kind-of experimental (haven't been tested thoroughly) and there are a number of things I would like to change with the interface.

KarthikRIyer commented 4 years ago

This seems good. I could add a link to the bindings release in the release notes as well as the docs, for now.

KarthikRIyer commented 4 years ago

I'll make a image showing tvOS support later today. Also I could clean up and release a version of the AppKit/UIKit/SwiftUI bindings.

@karwa any updates on this?