fslaborg / Deedle

Easy to use .NET library for data and time series manipulation and for scientific programming
http://fslab.org/Deedle/
BSD 2-Clause "Simplified" License
933 stars 195 forks source link

Abstract address #283

Closed buybackoff closed 9 years ago

buybackoff commented 9 years ago

This is to make Address abstract

Currently, Address is defined as type Address = int64 and in many places (not only linear implementation) int64 is used where Address is implied. Aliasing makes it possible. Current definition of address itself implies that it starts with zero, is linear and continuous - effectively the same int32/64. However, an abstract address is just a pointer to get a value from a vector.

A better definition of address is anything with the following properties:

An abstract address implements the following interface:

type IAddress<'T when 'T :> IAddress<'T>
              and 'T : comparison 
              and 'T : struct 
              and 'T : (new : unit -> 'T)
            >= 
interface
  /// Next possible address (not next address with a value)
  abstract Increment : unit -> 'T
  /// Previous possible address (not previous address with a value)
  abstract Decrement : unit -> 'T
end

A helper class adds op_Explicit constraints and defines the only possible operations on an address:

[<AbstractClass;Sealed>]
type private AddressHelper<'T when 'T :> IAddress<'T>
                    and 'T : comparison 
                    and 'T : struct 
                    and 'T : (new : unit -> 'T)
                    and 'T : (static member op_Explicit: int64 -> 'T)
                    and 'T : (static member op_Explicit: 'T -> int64)
                >() =
static let zero = new 'T()
static let invalid = zero.Decrement()
static member inline Increment(address : 'T) : 'T = address.Increment()
static member inline Decrement(address : 'T) : 'T = address.Decrement()
/// Invalid address in defined as zero address decremented
static member inline Invalid with get() : 'T = invalid
static member inline AsInt64<'T>(addr : 'T) : int64 = int64 addr
static member inline OfInt64<'T>(value : int64) : 'T = 
  ( ^T : (static member op_Explicit : int64 -> 'T) value) 

There is no method generateRange (lo:Address, hi:Address) from the current implementation, because an abstract address cannot know all used addresses between lo and hi, but only all possible addresses, which could be a very huge number in a general case.

Such definition of address makes a clear distinction between an address, a key and an index of a value. An address is something that corresponds 1-to-1 to a key (and derived from it) and something that could be used to get a value from a vector. An index (lower case) is a position of a key or a value from the beginning of an Index (capitalized, IIndex<'K>) or a vector (possibly the word position is better to avoid confusion with the IIndex).

The default linear address is implemented as:

/// Represents a type used for addressing values in a linear vector/index
[<CustomEquality; CustomComparison>]
type Address = // linear address
struct
  val private value : int64
  new(index:int64) = {value = index}
end
override x.Equals(yobj) =
  match yobj with
  | :? Address as y -> (x.value = y.value)
  | _ -> false
override x.GetHashCode() = x.value.GetHashCode()
override x.ToString() = x.value.ToString()
interface System.IComparable<Address> with
  member x.CompareTo y = x.value.CompareTo(y.value)
interface System.IComparable with
  member x.CompareTo other = 
    match other with
    | :? Address as y -> x.value.CompareTo(y.value)
    | _ -> invalidArg "other" "Cannot compare values of different types"

static member op_Explicit(addr: Address) : int64 = addr.value
static member op_Explicit(value: int64) : Address = Address(value)

interface IAddress<Address> with
  member x.Decrement() = Address(x.value - 1L)
  member x.Increment() = Address(x.value + 1L)

type AddressHelper = AddressHelper<Address>

The address module is redefined as:

  [<CompilationRepresentation(CompilationRepresentationFlags.ModuleSuffix)>]
  module Address =
    let invalid : Address = AddressHelper.Invalid
    let inline asInt64 (x:Address) : int64 = AddressHelper.AsInt64(x)
    let inline asInt (x:Address) : int = int <| AddressHelper.AsInt64(x)
    let inline ofInt64 (x:int64) : Address = AddressHelper.OfInt64(x)
    let inline ofInt (x:int) : Address = AddressHelper.OfInt64(int64 x)
    let inline increment (x:Address) = AddressHelper.Increment(x)
    let inline decrement (x:Address) = AddressHelper.Decrement(x)

Both Index and Vector implement additional methods:

/// Return address at a specified index
abstract AddressAt : int64 -> Address
/// Return index at a specified address
abstract IndexAt : Address -> int64

These methods are used everywhere instead of Address.zero, Address.asInt64, Address.ofInt64. Even though an address type is required to have op_Explicit to and from int64, such conversion doesn't convert an address to an index and vice versa. The int64 conversion is used for implementation: any practical address space is less than 2^64 and could be encoded as int64 (e.g. number of nanoseconds since 1/1/1900 will exceed 2^62 only in 2046).

Linear Index implements these properties as:

member inline private x.AddressAt(index) = Address.ofInt64 index
member inline private x.IndexAt(address) = Address.asInt64 address

All methods are inlined, therefore these methods directly call op_Explicit methods from an IAddress implementation. A quick performance tests shows that calling Address.ofInt64 is as fast as calling a constrcutor or just using int64 directly.

// In fsi a, b and c show the same timing (40-50 msec after fsi reset), d is more than 20 times slower (c.1100 msec after fsi reset)
let a = Array.init 10000000 (fun i -> Address.ofInt64(int64 i)) |> Array.length
let b = Array.init 10000000 (fun i -> Address(int64 i)) |> Array.length
let c = Array.init 10000000 (fun i -> (int64 i)) |> Array.length
let d = Array.init 10000000 (fun i -> Activator.CreateInstance<Address>()) |> Array.length

The abstract address implementation decouples Deedle's data structures from int64 and allows to reuse Deedle's interfaces; yet it keeps existing API unchanged and passes all tests (on my Win machine and on Travis's mono on Mac).

hmansell commented 9 years ago

Thanks very much, Victor!

Hopefully @tpetricek can review and merge this. I know he did some work on indexing as part of the bigdeedle work.

buybackoff commented 9 years ago

What is BigDeedle? What is in a roadmap for Deedle? Is this just #247, or something bigger?

hmansell commented 9 years ago

The relevant change to Deedle is #247. The idea is to support frames that are a virtual view on a potentially massive data store that would not fit in memory. Change #247 includes basic support for this, but there is more to do to make sure that operations don't needlessly materialize the whole frame, and so that they work in a way that the relevant parts of the frame are paged into memory without undue thrashing.

The unfortunate thing is that the only real implementation we have is against an internal proprietary market data store we have, so it is hard to demonstrate this stuff.

adamklein commented 9 years ago

This is a nice fulfillment of the abstraction that Tomas originally implemented. I'm glad you did some timings to see there is no drawback to converting int64 to Address for element dereferencing in the general case. That was always my concern. As for terminology - I think the terms "position" or "index" are ok, but personally, I like "offset", as we're describing the relative offset into something enumerable (eg: keys / values).

tpetricek commented 9 years ago

The change looks good to me - I have not tested it yet, but I think it should work fine.

That said, I would be quite curious if you have any use case in mind that requires this abstraction? I originally tried to make Address abstract but then realized that I cannot think of any reasonable situation where it would not be possible to just use int64 - and so I got rid of the abstraction.

I was thinking about, for example, a situation when given a key we cannot directly calculate the offset (or maybe we want to avoid doing binary search that is used in BigDeedle) - but I'm not sure that would work.

I'd be happy to merge this - because it does not complicate the implementation that much - but I think it would be good to have some evidence that the abstraction is actually needed for something!

EDIT: I suspect this might give us a better way of doing something like lazy loading. I'll need to add some documentation for BigDeedle, but there, we assume that we can always find int64 location corresponding to a given key (and we also need the total size of the series/frame).

buybackoff commented 9 years ago

@tpetricek I have a use case in mind when missing values are not stored and there could be holes in the key space (in general, the keys could be non-linear or even 2D/lexicographic...). I am still not sure that this could work. But without this change it will definitely not work.

I tried to make this change so that it doesn't affect the performance and doesn't change any API. I hope you could validate the statement that in this implementation we are calling op_Explicit directly on struct, because I only did a very quick test in FSI.

adamklein commented 9 years ago

I can imagine a complex mapping between address and offset/pointer, in which it helps to write code and think in address space, having abstracted over offsets. But like Tomas, I have never found this practically to be an issue. In theory I can see the utility.

buybackoff commented 9 years ago

Another option is to remove Address completely and live with Int64, but not keep the current confusing state. Address abstraction is not something that one truly needs. When I did the changes, I was just revisiting again Deedle's internals and tried to understand if I could extend it for a different addressing scheme while keeping most of the stuff unchanged (kind of virtual storage). Int64 is very good for what it does now and RAM is so cheap that in-memory processing will do the job for 99% cases.

hmansell commented 9 years ago

We already use BigDeedle with data that is not in-memory. Though an Int64 address works fine in that particular case because the rows are still ordinally addressed.

buybackoff commented 9 years ago

@hmansell How to you deal with issues like #219? Do you still load data in chunks and every time your current working set is in memory?

hmansell commented 9 years ago

A CSV file is not ordinally indexable unless you first scan the file to determine where each row starts. In our case, our data is truly ordinally indexable, and loaded into memory into chunks that are kept in an LRU cache. There are a few other characteristics of the storage mechanism that make this faster, too.

tpetricek commented 9 years ago

BTW - I was chatting with @hmansell about some new requirements for "Big Deedle" and we might actually need this in cases where we cannot (efficiently) calculate the number of total elements in a series or rows in a frame. So once I'll start looking into that, I'll come back to this PR!

tpetricek commented 9 years ago

I think we have a scenario where this would be useful, so I'll see if I can get this merged and use this!

The scenario is when data is stored in multiple partitions and so the address is essentially an index of the partition together with actual offset in the partition.

@buybackoff Did you ever write any alternative implementation of Address to test those abstractions? As far as I can see, there is still a number of places where we convert the address to int64 and then iterate over addresses, so I think that even with this change, there are still things that won't work correctly if the address is not essentially a linear index. I'm looking into this now, so I'll see what else needs to be changed...

buybackoff commented 9 years ago

@tpetricek No, I haven't implemented an alternative address. In rare cases when continuous arrays do not work well, I use a series abstraction similar to .NET's SortedList<K, SortedList<K,V>> with optimizations and persistence similar to bcolz. This commit was my second attempt to merge that abstraction with Deedle. As the first step, I isolated all usage of Address to see where the address type, not an offset, is actually used. I thought about implementing an index as an identity that returns a key as an address value, and then to use my series structure as a vector. But this required too much effort for vector/index builders and was impractical. At the same time, I had a chance to test Deedle on a real task that in less than an hour of work reduced 4+ hours of calculations in a legacy Excel/VBA file to less that 10 minutes. This is how I learned to stop worrying about imaginary use cases and love Deedle as it is :) It is more than good for data exploration, and I stopped trying to add complex event processing where it doesn't fit. Instead, I implemented bcolz-like storage, and currently I am investigating Nessos.Streams and Hopac's lazy streams approaches for a 'universal' streaming series abstraction.

I didn't try at all to change int64 to Address inside LinearVector and LinearIndex. 'Linear' is in their name and purity inside this implementation classes was unnecessary, given that explicit int64 conversions are a part of the new Address interface. To find inconsistencies in other places, I was changing the alias type to uint64 and checked where type inference was failing outside the 'Linear' files. Then tests and reviewing. I could have missed some other places.

tpetricek commented 9 years ago

@buybackoff Thanks for the reply! This will be useful for what I'm doing.

I think I'll still keep Address = int64, but I'm trying to remove the assumption that you can increment address to get the next element (at least, for some special kinds of indices). The idea is to use int64 as an encoding of two int values that represent two-level index (partition index with an offset in the partition).

tpetricek commented 9 years ago

I used some of the ideas from this in #302, so thanks a lot for doing this work!

Basically, we still have Address = int64, but the individual vectors/indices can carry "address operations" that use the value a bit differently. So, as I mentioned above, this now makes it possible to use partition/offset scheme (for example). So, maybe this is good enough to let us experiment with alternative schemes! (I'm still curious about moving away from the int64 requirement too, but it needs to be done in a way where one index/vector can still use it and another can use another scheme - which was not really possible in this PR.)

There is some more information about this in the design notes comment about BigDeedle.