MLton / mlton

The MLton repository
http://mlton.org
Other
951 stars 126 forks source link

Factor out primitive sequences #68

Closed robsimmons closed 10 years ago

robsimmons commented 10 years ago

The numeric part of the basis build seems designed with the idea in mind that IntInf should be a perfectly reasonable default integer type. However, that means that vectors are subscripted by integers, and the goal of #60 is to make integers out of vectors. I wanted to discuss how to proceed.

I think the thing to do is to factor out the SeqIndex.int parts of the SEQUENCE signature into a PrimSequence functor and PRIM_SEQUENCE signature, and then add primitive arrays and vectors to the build sequence right after num0.sml. There's the additional problem that we don't have maxLen, geu, or gtu around already at this point; as I currently see it that may require some duplicated code.

While thinking about implementing this I noticed that a lot of sequence operations are defined in terms of slice operations (sub and unsafeSub were the two that surprised me) - is the optimizer able to factor out the apparent cost of pushing those operations through Slice?

MatthewFluet commented 10 years ago

I think a functor PrimSequence ( ... ) : PRIM_SEQUENCE sounds like a fine way to make a significant portion of the array/vector functionality available earlier. I think that you do have SeqIndex.maxInt, SeqIndex.geu and SeqIndex.gtu around after num0.sml, so you may not need any duplication.

As for defining full operations in terms of slice operations, most of the array/vector operations are small enough to be inlined and slice data structure is eliminated.

robsimmons commented 10 years ago

Thanks for your response! I was including the new sequence0 after num0 but before SeqIndex got rebound; doing it just a little later gives me what I need, thank you.

robsimmons commented 10 years ago

So here's something that is a bit confounding for this approach: maxLen' is calculated in sequence.fun to be the minimum of Int.maxInt (if it's a value) and SeqIndex.maxInt'. This is basically reasonable, but now maxLen is being defined before we know Int.maxInt, in a kind of important way (IntInf hasn't been defined yet).

I can't imagine what use having a SeqIndex that was larger than Int would actually serve, so it seems like the right solution is just to say that the maximum length of a sequence is just SeqIndex.maxInt, full stop. However, because the correctness of sequence.fun pervasively depends on this assumption, if I'm going to refactor the code to assume that, it seems like there needs to be a check somewhere that this is internally consistent, and I'm not sure how to go about that. One option is the following, which will cause Overflow to be raised in the basis if the necessary condition isn't satisfied.

      local
         structure S =
            Int_ChooseInt
            (type 'a t = 'a
             val fInt8 = Int8.fromLarge (SeqIndex.toLarge Prim.maxLen)
             val fInt16 = Int16.fromLarge (SeqIndex.toLarge Prim.maxLen)
             val fInt32 = Int32.fromLarge (SeqIndex.toLarge Prim.maxLen)
             val fInt64 = Int64.fromLarge (SeqIndex.toLarge Prim.maxLen)
             val fIntInf = SeqIndex.toLarge Prim.maxLen)
      in
         val maxLen = S.f
      end

Is that a reasonable way to proceed, or are exceptions in the basis due to integer size misconfiguration a no-no?

robsimmons commented 10 years ago

The current calculation takes place here, by the way. As an aside, Line 104 there should only typecheck if Int = Int32... has the basis actually been compiled with Int as anything else?

MatthewFluet commented 10 years ago

Regarding arrays-and-vectors/sequence.fun#L104: At build/sources.mlb#L88, the structure Int does not match the SML Basis Library's signature INTEGER. At this point, the precision value is a Int32.int option.

The type-check-all target of basis-library/Makefile type checks the Basis Library implementation under all possible configurations. I've done self-compiles with -default-type int64 and -default-type intinf.

MatthewFluet commented 10 years ago

I think the maxLen' concern is a non-issue.

One thing to keep in mind is that the Basis Library implementation pervasively uses the following idiom:

structure Foo =
struct
   open Foo
   (* shadow previous Foo declarations and/or add new declarations *)
end

While this is somewhat of an anti-pattern in SML programming, because it obscures what structure Foo is at any given point in the program elaboration, it is particularly helpful in the Basis Library implementation because so many of the structures are interdependent upon one another (and it is better than the alternative of one massive implementation structure which defines all of the declarations in use-def order, but without any module-level structuring). Consequently, in the Basis Library implementation, structure Int is not necessarily the Basis Library specification's structure Int : INTEGER. Indeed, the Basis Library specification's signature INTEGER is only defined at build/sources.mlb:L115 and only at build/sources.mlb:L138 is the "default" structure Int bound with declarations compatible with signature INTEGER.

Regarding the relationship between structure SeqIndex and the "default" structure Int: the SeqIndex.int type is meant to be a primitive type for indexing into an array or vector. Ultimately, array and vector indexing will be compiled to address arithmetic, so SeqIndex.int is bound to a type that is efficient for the target platform's address arithmetic; in particular, we want something like Array.foldl, which has a tight arithmetic loop behind the scenes, to do all of the arithmetic at the target's natural word size. Thus, on a 32-bit platform, structure SeqIndex follows structure Int32, while on a 64-bit platform, structure SeqIndex follows structure Int64. Consequently, on a 64-bit platform with the default -default-type int32, structure SeqIndex is larger than structure Int. On the other hand, we know that structure SeqIndex will always be a fixed-sized integer.

In the refactored sequence0.sml implementation, you shouldn't need anything from structure IntInf to implement the sequence operations that are all in terms of SeqIndex.int. Note that when you slice the computing of maxLen' out of the current calculation, you will have:

      local
         fun doit (precision, fromInt, maxInt') =
            if Primitive.Int32.>= (valOf SeqIndex.precision, precision)
               then fromInt maxInt'
            else SeqIndex.maxLen'
         structure S =
            Int_ChooseInt
            (type 'a t = SeqIndex.int
             val fInt8 = doit (valOf Primitive.Int8.precision,
                               SeqIndex.schckFromInt8,
                               Primitive.Int8.maxInt')
             val fInt16 = doit (valOf Primitive.Int16.precision,
                                SeqIndex.schckFromInt16,
                                Primitive.Int16.maxInt')
             val fInt32 = doit (valOf Primitive.Int32.precision,
                                SeqIndex.schckFromInt32,
                                Primitive.Int32.maxInt')
             val fInt64 = doit (valOf Primitive.Int64.precision,
                                SeqIndex.schckFromInt64,
                                Primitive.Int64.maxInt')
             val fIntInf = SeqIndex.maxLen')
      in
         val maxLen' = S.f
      end

All of these values/operations will be available after elaborating integer/iwconv0.sml at build/sources.mlb:L22.

Then, the SML implementation of structure IntInf will be built using some structure Vector0 = Sequence0(...) and you will need to use structure SeqIndex operations for computing/comparing lengths, computing indices, etc.

robsimmons commented 10 years ago

Thank you for the type-chec-all pointer, I needed that. And of course, a 32-bit int and 64-bit size_t is entirely reasonable, as I should have seen.

In trying to take this approach, I run into the problem that it's the Int_ChooseInt functor itself that's not yet available after integer/iwconv0.sml. I can add config/default/default-$(DEFAULT_INT).sml to build/sources.mlb#L30-L33, but because IntInf.int is not defined yet (even in the current setup it's only defined as Primitive.IntInf.int), compilation then fails when trying to choose integers to be infinite-precision integers.

MatthewFluet commented 10 years ago

You should be able to do something like:

   local
      local
         ../config/bind/int-prim.sml
         ../config/bind/int-inf-prim.sml
      in  ann "forceUsed" in
         ../config/default/default-$(DEFAULT_INT).sml
      end end
   in
      ../arrays-and-vectors/sequence0.sml
   end

All of the code in config/bind and config/default is essentially "free" -- they are just conveniences for doing "type = type" and "structure = structure" declarations, which is just namespace management.

We could probably perform fewer rebindings, but you need to carefully use "where type" clauses to propagate type equalities through opaque signature matches. And, with MLton's defunctorization, there is no advantage to having fewer functor declarations or fewer rebindings.