edouarda / brigand

Instant compile time C++ 11 metaprogramming library
Boost Software License 1.0
574 stars 48 forks source link

brigand::sort uses a lot of RAM #118

Closed odinthenerd closed 8 years ago

odinthenerd commented 8 years ago

Using brigand::sort with a roughly 100 element list on the VisualGDB IDE I get cc1plus.exe: out of memory allocating 135167 bytes error. I understand that this is probably due to the fact that the provided GCC compiler can only access 2 GB or RAM for some reason. Despite the fact that this VisualGDB limitation is stupid there may also be some optimization potential in brigand. Using my cheap old insertion sort I used before discovering brigand (see below) I get to about 300 elements before I run out of memory.

Has there been any testing done to determine the current sort algorithm superior?

    namespace Detail {
        template<typename T_Out, template<typename, typename > class T_Pred,
                typename T_Insert, bool B_Tag, typename ... Ts>
        struct SortInsert;
        //next is less than insert, next is not end
        template<typename ... Os, template<typename, typename > class T_Pred,
                typename T_Insert, typename T1, typename T2, typename ... Ts>
        struct SortInsert<brigand::list<Os...>, T_Pred, T_Insert, true, T1, T2, Ts...> : SortInsert<
            brigand::list<Os..., T1>, T_Pred, T_Insert, T_Pred<T2, T_Insert>::value, T2,
                Ts...> {};
        //next is less than insert, next is end, terminate
        template<typename ... Os, template<typename, typename > class T_Pred,
                typename T_Insert, typename ... Ts>
        struct SortInsert<brigand::list<Os...>, T_Pred, T_Insert, true, Ts...> {
            using type = brigand::list<Os..., Ts..., T_Insert>;
        };
        //next is not less than insert, terminate
        template<typename ... Os, template<typename, typename > class T_Pred,
                typename T_Insert, typename ... Ts>
        struct SortInsert<brigand::list<Os...>, T_Pred, T_Insert, false, Ts...> {
            using type = brigand::list<Os..., T_Insert, Ts...>;
        };

        template<typename TOut, template<typename, typename > class P, typename ... Ts>
        struct Sort{
            static_assert(AlwaysFalse<TOut>::value,"implausible parameters");
        };
        //out and in are not empty
        template<typename O, typename ... Os,
                template<typename, typename > class TPred, typename TInsert,
                typename ... Ts>
        struct Sort<brigand::list<O, Os...>, TPred, TInsert, Ts...> : Sort<
                typename Detail::SortInsert<brigand::list<>, TPred, TInsert,
                        TPred<O, TInsert>::value, O, Os...>::type, TPred, Ts...>
        {};
        //out is empty, in is not empty
        template<typename ... Os, template<typename, typename > class TPred,
                typename TInsert, typename ... Ts>
        struct Sort<brigand::list<Os...>, TPred, TInsert, Ts...> : Sort<
                typename Detail::SortInsert<brigand::list<>, TPred, TInsert, false, Os...>::type,
                TPred, Ts...>
        {};
        //in is empty
        template<typename ... Os, template<typename, typename > class P, typename ... Ts>
        struct Sort<brigand::list<Os...>, P, Ts...> {
            using type = brigand::list<Os...>;
        };
    }

    //Sort
    template<typename TList, typename TPred = LessP>
    struct Sort{
        static_assert(AlwaysFalse<TList>::value,"implausible type");
    };

    //empty input case
    template<template<typename, typename > class TPred>
    struct Sort<brigand::list<>, Template<TPred>> {
        using type = brigand::list<>;
    };

    //one element case
    template<typename T, template<typename, typename > class TPred>
    struct Sort<brigand::list<T>, Template<TPred>> {
        using type = brigand::list<T>;
    };

    //two or more elements case
    template<typename ... Ts, template<typename, typename > class TPred>
    struct Sort<brigand::list<Ts...>, Template<TPred>> :
        Detail::Sort<brigand::list<>, TPred,    Ts...> {};

    //alias
    template<typename TList, typename TPred = LessP>
    using SortT = typename Sort<TList,TPred>::type;
edouarda commented 8 years ago

Sort isn't optimized yet, that could be a task for the next milestone. Problem is, we should probably check the behavior on several compilers.

What is the difference in terms of compilation speed?

odinthenerd commented 8 years ago

On GCC my sort is also faster, but not by much. I think we could optimize it though since TOut is sorted we could look ahead 10 and if that item is smaller peel off 10 at a time. That would make it less quadratic. Nearing O(n*(n/20 + 5)) on big list. Big O notation is so clumsy in this context, is there a better nomenclature?. I have not checked on anything but GCC 4.8, I am at FOSDEM this weekend but if I find free time next week I will brigandize it, add the short circuiting and get it working on MSVC2013 so we have some real comparison.

edouarda commented 8 years ago

It would also be cool to see on clang, that's even more important as clang is gaining market shares even on Windows.

odinthenerd commented 8 years ago

I personally need it super optimized on gcc only but I will make several versions, it think peeling off several from the input, sorting them and then merging that with the output may also be faster.

edouarda commented 8 years ago

If you see a significant gain on gcc and there is no degradation on other compilers, that's good enough for me.

odinthenerd commented 8 years ago

I have realized bench marking compile time for different use cases is not a lot of fun ;)

Compiling on my GCC 4.8 the following implementation is about 5x faster when sorting 280 std::integral_constants and uses about 2.5x less RAM:

    namespace Detail {
        template<template<typename, typename> class Pred>
        struct Sort
        {
            template<typename ...Ts>
            using l = void(*)(Ts*...);
            template<typename In, typename ... Os, typename T>
                static l<Os..., T, In> sort_insert(In*, std::true_type*, void(*)(Os*...), T*) { return {} ; }

            template<typename In, typename ... Os, typename ... Ts>
                static l<Os..., In, Ts...> sort_insert(In*, std::false_type*, void(*)(Os*...), Ts*...) { return {}; }

            template<typename In, typename ... Os, typename T1, typename T2, typename ... Ts>
                static  auto sort_insert(In*, std::true_type*, void(*)(Os*...), T1*, T2*, Ts*...)->decltype(
                        sort_insert(
                        (In*)0, 
                        (typename Pred<T2,In>::type *)0,
                        (void(*)(Os*..., T1*))0,
                        (T2*)0,
                        ((Ts*)0)...)) 
                        { return { }; }

            template<typename ... Ts>
            static list<Ts...> sort(void(*)(Ts*...)) //in is empty
            {
                return { };
            };
                //in is not empty
            template<typename O, typename ... Os, typename In, typename ... Ts>
            static auto sort(void(*)(O*, Os*...), In*, Ts*...)->decltype(
                        sort(
                            sort_insert(
                                (In*)0,
                                (typename Pred<O, In>::type *)0,
                                (void(*)())0,
                                (O*)0,
                                ((Os*)0)...),
                            ((Ts*)0)...))
                {
                    return { };
                }
        };
    }

    //Sort
    template<typename TList, typename TPred>
        struct sort_impl;
    //empty input case
    template<template<typename, typename > class TPred>
        struct sort_impl<list<>, brigand::quote<TPred>> {
            using type = list<>;
        };
        //one element case
    template<typename T, template<typename, typename > class TPred>
        struct sort_impl<list<T>, brigand::quote<TPred>> {
            using type = list<T>;
        };
        //two or more elements case
    template<typename T1, typename T2, typename ... Ts, template<typename, typename > class TPred>
        struct sort_impl<list<T1, T2, Ts...>, brigand::quote<TPred>> {
            using type = decltype(Detail::Sort<TPred>::sort((void(*)(T1*))0, (T2*)0, ((Ts*)0)...));
            };
    template<typename TList, typename TPred>
        using sort = typename sort_impl<TList, TPred>::type;

It works on MSCV2015 too although its only about 5% more efficient than the current implementation there.

It seems using builtin types rather than class templates like using function pointers to represent lists yields a considerable advantage on GCC, I have not yet tried clang.

Fast tracking the insertion and ripping off more than one element at a time are still optimization potential I am working on.

Ripping off more than one element per recursion seems unavoidable since I'm hitting recursion limits now before I run out of RAM at about 500 elements.

ldionne commented 8 years ago

I'm very surprised that using overloading is faster than partial specialization. I would have expected your initial implementation using structs to be faster. Is this not the case?

odinthenerd commented 8 years ago

No functions are faster. My first implementation with 300 elements takes about 26 seconds and peaks 1.3 GB ram, the new one using functions takes about 5 seconds and peaks at 380 MB ram! I was surprised too which Is why I posted the intermediate results here.

odinthenerd commented 8 years ago

Actually a hybrid might be interesting, using function pointers instead of list but using partial specialization instead of overloading.

ldionne commented 8 years ago

Can you post the benchmark in a Gist? I'd like to try it with Clang.

edouarda commented 8 years ago

Yes more that one compiler would be nice, but that's some interesting data though.

odinthenerd commented 8 years ago

It seems my code does not work on clang, not sure if its a clang bug: see stackoverflow question. I will update.

odinthenerd commented 8 years ago

Looks like MSVC and GCC are wrong, does work with C++14 with a slight modification though.

I also made a hybrid of both sorts for comparison:

    namespace Detail {
        template<template<typename, typename> class Pred>
            struct Sort
            {

            template<typename In, bool B, typename L, typename... Ts>
                    struct si;
            template<typename In, typename...Os, typename T>
                struct si<In, true, void(*)(Os...), T>
                        {
                            using type = void(*)(Os...,
                                T,
                                In);
                        };
            template<typename In, typename...Os, typename...Ts>
                struct si<In, false, void(*)(Os...), Ts...>
                        {
                            using type = void(*)(Os...,
                                In,
                                Ts...); 
                        };

            template<typename In, typename...Os, typename T1, typename T2, typename...Ts>
                struct si<In, true, void(*)(Os...), T1, T2, Ts...> : si<In, Pred<T2, In>::value, void(*)(Os..., T1), T2, Ts...>
                        {
                        };

            template<typename L, typename ... Ts>
                        struct sort;
            template<typename ... Os>
                struct sort<void(*)(Os...)>
                        {
                            using type = brigand::list<Os...>;
                        };
                            //in is not empty
            template<typename O, typename ... Os, typename In, typename ... Ts>
                struct sort<void(*)(O, Os...), In, Ts...> : sort<typename si<In, Pred<O, In>::value, void(*)(), O, Os...>::type, Ts...> {};
            };
    }

    //Sort
    template<typename TList, typename TPred>
        struct sort_impl;
    //empty input case
    template<template<typename, typename > class TPred>
        struct sort_impl<list<>, brigand::quote<TPred>> {
            using type = list<>;
        };
        //one element case
    template<typename T, template<typename, typename > class TPred>
        struct sort_impl<list<T>, brigand::quote<TPred>> {
            using type = list<T>;
        };
        //two or more elements case
    template<typename T1, typename T2, typename ... Ts, template<typename, typename > class TPred>
        struct sort_impl<list<T1, T2, Ts...>, brigand::quote<TPred>> : Detail::Sort<TPred>::template sort<void(*)(T1), T2, Ts...> {};
    template<typename TList, typename TPred>
        using sort = typename sort_impl<TList, TPred>::type;

I am not seeing any advantage when using the hybrid vs the plain vanilla insertion sort (first one I posted in this thread) and the function sort (second one) is about 20% slower on clang.

On GCC the hybrid is slightly faster than the plain vanilla and the function version is considerably faster as mentioned earlier.

In conclusion the function trick is awesome on GCC but is technically enabled by a bug in the compiler in C++11, is implementable in C++14 and doesn't help at all (makes things worse) on other compilers.

Another thing I noticed with clang in Visual Studio 15 is that -ftemplate-depth=256, im not sure if that is the default elsewhere but it would be nice to sort big lists without hitting that barrier. I think we can decrease the algorithmic complexity further.

odinthenerd commented 8 years ago

Fast tracking the insert (only skipping 4) looks about 3x better on both clang and GCC


    namespace Detail {
        template<template<typename, typename> class Pred>
        struct S
        {
            template<typename Out, typename In,
                bool Tag, bool FTag, typename ... Ts>
            struct SortInsert;

            //next is not less than insert, terminate
            template<typename ... Os, typename In, typename ... Ts>
            struct SortInsert<list<Os...>, In, false, false, Ts...> {
                using type = list<Os..., In, Ts...>;
            };
            //next is less than insert, next is end, terminate
            template<typename ... Os, typename In, typename T>
            struct SortInsert<list<Os...>, In, true, false, T> {
                using type = list<Os..., T, In>;
            };
            //next is less than insert, next is not end
            template<typename ... Os, typename In,
                typename T1, typename T2, typename ... Ts>
            struct SortInsert<list<Os...>, In, true, false, T1, T2, Ts...> :
                SortInsert<list<Os...,  T1>,
                In,
                Pred<T2, In>::value,
                false,
                T2, Ts...> {};
            //fast track is less than insert, no more
            template<typename ... Os, typename In, typename T1, typename T2, typename T3, typename T4>
            struct SortInsert<list<Os...>, In, true, true, T1, T2, T3, T4> {
                using type = list<Os..., T1, T2, T3, T4, In>;
            };
            //fast track is less than insert, not enough to fast track again
            template<typename ... Os, typename In, typename T1, typename T2, typename T3, typename T4, typename ... Ts>
            struct SortInsert<list<Os...>, In, true, true, T1, T2, T3, T4, Ts...> :
                SortInsert<list<Os...,  T1, T2, T3, T4>,
                In,
                Pred<T2,In>::value,
                false,
                Ts...> {};
            //fast track is less than insert, can fast track again
            template<typename ... Os, typename In, typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename ... Ts>
            struct SortInsert<list<Os...>, In, true, true, T1, T2, T3, T4, T5, T6, T7, T8, Ts...> :
                SortInsert<list<Os...,T1,T2,T3,T4>,
                In,
                Pred<T5, In>::value,
                Pred<T8, In>::value,
                T5, T6, T7, T8,
                Ts...> {};

            template<typename Out, typename ... Ts>
            struct Sort {
                using type = Out;  //in is empty
            };
            //out cant fast track
            template<typename O, typename ... Os, typename In, typename ... Ts>
            struct Sort<list<O, Os...>, In, Ts...> :
                Sort< typename SortInsert<list<>,
                In,
                Pred<O,
                In>::value,
                false,
                O,
                Os...>::type,
                Ts...> {};
            //out can fast track 
            template<typename O1, typename O2, typename O3, typename O4, typename ... Os, typename In, typename ... Ts>
            struct Sort<list<O1,O2,O3,O4, Os...>, In, Ts...> :
                Sort< typename SortInsert<list<>,
                In,
                Pred<O1,In>::value,
                Pred<O4,In>::value,
                O1, O2, O3, O4,
                Os...>::type,
                Ts...> {};
        };
    }

    //Sort
    template<typename TList, typename TPred>
    struct sort_impl;
    //empty input case
    template<template<typename, typename > class TPred>
    struct sort_impl<list<>, brigand::quote<TPred>> {
        using type = list<>;
    };
    //one element case
    template<typename T, template<typename, typename > class TPred>
    struct sort_impl<list<T>, brigand::quote<TPred>> {
        using type = list<T>;
    };
    //two or more elements case
    template<typename T, typename U, typename ... Ts, template<typename, typename > class TPred>
    struct sort_impl<list<T,U,Ts...>, brigand::quote<TPred>> : Detail::S<TPred>::template Sort<list<T>, U, Ts...> {};
    //alias
    template<typename TList, typename TPred>
    using sort = typename sort_impl<TList, TPred>::type;
odinthenerd commented 8 years ago

After making the fast track skip 8 rather than 4 and ripping off 4 elements at a time in sort and feeding them to SortInsert (in order to get around max recursion of 250 default in clang) I get to about 650 elements in clang and 550 in GCC before I hit 2 gigs of memory use. I think merging a list of presorted elements into SortInsert would make it even faster.

odinthenerd commented 8 years ago

    namespace Detail {
        template<template<typename, typename> class Pred>
        struct S
        {
            template<typename Out, typename In,
                bool Tag, bool FTag, typename ... Ts>
            struct SortInsert;

            //next is not less than insert, no more ins, terminate
            template<typename ... Os, typename In, typename ... Ts>
            struct SortInsert<list<Os...>, list<In>, false, false, Ts...> {
                using type = list<Os..., In, Ts...>;
            };
            //next is less than insert, next is end, terminate
            template<typename ... Os, typename... Ins, typename T>
            struct SortInsert<list<Os...>, list<Ins...>, true, false, T> {
                using type = list<Os..., T, Ins...>;
            };
            //next is not less than insert, have more next and have more ins, cant fast track
            template<typename ... Os, typename In1, typename In2, typename...Ins, typename T, typename ... Ts>
            struct SortInsert<list<Os...>, list<In1, In2, Ins...>, false, false, T, Ts...> :
                SortInsert<list<Os..., In1>,
                list<In2, Ins...>,
                Pred<T, In2>::value,
                false,
                T, Ts...> {};
            //next is not less than insert, have more next and have more ins, can fast track
            template<typename ... Os, typename In1, typename In2, typename...Ins, typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename ... Ts>
            struct SortInsert<list<Os...>, list<In1, In2, Ins...>, false, false, T1, T2, T3, T4, T5, T6, T7, T8, Ts...> :
                SortInsert<list<Os..., In1>,
                list<In2, Ins...>,
                Pred<T1, In2>::value,
                Pred<T8, In2>::value,
                T1, T2, T3, T4, T5, T6, T7, T8, Ts...> {};
            //next is less than insert, next is not end
            template<typename ... Os, typename In, typename... Ins,
                typename T1, typename T2, typename ... Ts>
            struct SortInsert<list<Os...>, list<In,Ins...>, true, false, T1, T2, Ts...> :
                SortInsert<list<Os..., T1>,
                list<In,Ins...>,
                Pred<T2, In>::value,
                false,
                T2, Ts...> {};
            //fast track is less than insert, no more
            template<typename ... Os, typename In, typename...Ins, typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8>
            struct SortInsert<list<Os...>, list<In,Ins...>, true, true, T1, T2, T3, T4, T5, T6, T7, T8> {
                using type = list<Os..., T1, T2, T3, T4, T5, T6, T7, T8, In, Ins...>;
            };
            //fast track is less than insert, not enough to fast track again
            template<typename ... Os, typename In, typename...Ins, typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename ... Ts>
            struct SortInsert<list<Os...>, list<In, Ins...>, true, true, T1, T2, T3, T4, T5, T6, T7, T8, Ts...> :
                SortInsert<list<Os..., T1, T2, T3, T4, T5, T6, T7, T8>,
                list<In, Ins...>,
                Pred<T2, In>::value,
                false,
                Ts...> {};
            //fast track is less than insert, can fast track again
            template<typename ... Os, typename In, typename...Ins, typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename T9, typename T10, typename T11, typename T12, typename T13, typename T14, typename T15, typename T16, typename ... Ts>
            struct SortInsert<list<Os...>, list<In, Ins...>, true, true, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, Ts...> :
                SortInsert<list<Os..., T1, T2, T3, T4, T5, T6, T7, T8>,
                list<In, Ins...>,
                Pred<T9, In>::value,
                Pred<T16, In>::value,
                T9, T10, T11, T12, T13, T14, T15, T16,
                Ts...> {};

            template<typename Out, typename In>
            struct CallSortInsert;

            //out cant fast track
            template<typename O, typename ... Os, typename In, typename... Ins>
            struct CallSortInsert<list<O, Os...>, list<In,Ins...>> :
                SortInsert<list<>,
                list<In,Ins...>,
                Pred<O, In>::value,
                false,
                O,
                Os...> {};
            //out can fast track 
            template<typename O1, typename O2, typename O3, typename O4, typename O5, typename O6, typename O7, typename O8, typename ... Os, typename In, typename... Ins>
            struct CallSortInsert<list<O1, O2, O3, O4, O5, O6, O7, O8, Os...>, list<In,Ins...>> :
                SortInsert<list<>,
                list<In,Ins...>,
                Pred<O1, In>::value,
                Pred<O8, In>::value,
                O1, O2, O3, O4, O5, O6, O7, O8,
                Os...> {};

            template<typename Out, typename ... Ts>
            struct Sort {
                using type = Out;  //in is empty
            };
            //out cant fast track
            template<typename Out, typename In, typename ... Ts>
            struct Sort<Out, In, Ts...> :
                Sort< typename CallSortInsert<Out, list<In>>::type,
                Ts...> {};
            //out can fast track 
            template<typename Out, typename T1, typename T2, typename T3, typename T4, typename T5, typename T6, typename T7, typename T8, typename T9, typename T10, typename T11, typename T12, typename T13, typename T14, typename T15, typename T16, typename ... Ts>
            struct Sort<Out, T1, T2, T3, T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16, Ts...> :
                Sort< typename CallSortInsert<Out, typename Sort<list<T1>,T2,T3,T4, T5, T6, T7, T8, T9, T10, T11, T12, T13, T14, T15, T16>::type>::type,
                Ts...> {};
        };
    }

    //Sort
    template<typename TList, typename TPred>
    struct sort_impl;
    //empty input case
    template<template<typename, typename > class TPred>
    struct sort_impl<list<>, brigand::quote<TPred>> {
        using type = list<>;
    };
    //one element case
    template<typename T, template<typename, typename > class TPred>
    struct sort_impl<list<T>, brigand::quote<TPred>> {
        using type = list<T>;
    };
    //two or more elements case
    template<typename T, typename U, typename ... Ts, template<typename, typename > class TPred>
    struct sort_impl<list<T, U, Ts...>, brigand::quote<TPred>> : Detail::S<TPred>::template Sort<list<T>, U, Ts...> {};
    //alias
    template<typename TList, typename TPred>
    using sort = typename sort_impl<TList, TPred>::type;

Merging two sorted lists with fast tracking seems to be the way to go. I can get over 1024 elements on clang and GCC and about 10x faster than the current brigand implementation on both. I don't see any reason why MSVC13 should not work with this, on MSVC15 its about 10x faster until about 300 elements and then runs out of memory on my machine (only 4 GB of RAM available for MSVC) where as the current brigand implementation gets to about 400 elements, albeit considerably slower, before it runs out of RAM.

I will test on MSVC2013 and make a pull request unless anyone else thinks of anything more to test. My goal was to get to 1000 elements on GCC.

ldionne commented 8 years ago

Thanks for the analysis. I'll try to adapt this sort for use with Hana, and benchmark both the current implementation and this one. I'll keep the fastest one and report the results here (I'll credit you, don't worry).

odinthenerd commented 8 years ago

I just realized your the hana guy, awesome work you have been doing! If I ever see you at a conference you are in for some serious free beer ;)

ldionne commented 8 years ago

Well thanks, but in this case I should definitely be the one buying you a beer since you did all the hard work!

edouarda commented 8 years ago

Omg Louis is the Hana guy???? :-)

@porkybrain If you do a pull request it will trigger a build and we will know about MSVC 2013. Worst case we will do a ifdef.

odinthenerd commented 8 years ago

I can confirm that it works on MSVC2013, same weird ceiling though, I get to 450 in like 5 seconds but it crashes on this machine at 500. The current brigand implementation uses more RAM to get to 450 and takes like 15 minutes to get to 500 but does not crash.

I think more realistic benchmarks like sorting 100 lists ranging in size from 5 to 200 elements would be interesting. I have just been testing for max length so far.

edouarda commented 8 years ago

500 elements is indeed a lot. If it works faster and better, a pull request is warmly welcomed.

odinthenerd commented 8 years ago

Found a GCC bug while making this container agnostic, it seems GCC considers this

template<typename...>
    struct List{};

template<template<typename...> class Seq>
    struct S
{
    template<typename T>
        struct A;
    template<typename T, typename...Ts>
        struct A<Seq<T, Ts...>>
        {
            using type = int;
        };
    template<typename T, typename U, typename...Ts>
        struct A<Seq<T, U, Ts...>>
        {
            using type = bool;
        };
    };

using g = typename S<List>::template A<List<int, int>>::type;

to be ambiguous.

mkurdej commented 8 years ago

Isn't GCC right? Which specialization should it choose when you instantiate A<Seq<T1, T2, T3>>? Both are valid to my eyes (the first: T=T1, Ts={T2, T3}, and the second: T=T1, U=T2, Ts={T3}.

If I were you, I would rather have 3 specializations:

odinthenerd commented 8 years ago

The second specialization matches a smaller set of possible inputs (e.g. second specialization does not match List<int> but first does) so it should be considered more specialized and therefore be resolved to in this example. MSVC and Clang both work with this exact example and GCC works if I deduce Seq in A's parameter list rather than taking it from S's parameter list.

It actually looks a tiny bit faster if I remember what container the user called sort with, use brigand::list under the hood and then put the result back in the original container but it could just be noise ;) It does work as a work around though.

odinthenerd commented 8 years ago

124 should solve all the issues discussed here.

ldionne commented 8 years ago

It turns out that this fast-tracked version of insertion sort performs worse than a naive insertion sort currently used in Hana. The difference is not huge, but it's there. The benchmark was done with Apple LLVM version 6.1.0 (based on Clang 3.6.0). screen shot 2016-02-25 at 21 58 54

ldionne commented 8 years ago

@porkybrain Could you please post a gist to the benchmark you used, if you still have it around?

odinthenerd commented 8 years ago

My bench marking is probably way less sophisticated than yours. I call the compiler from a batch like

set startTime=%time% invoke compiler here echo Start Time: %startTime% echo Finish Time: %time%

so its wall clock time and not cpu time but I had nothing else running so the result was pretty stable running the same test a few times.

I generated a list of std::integral_constants with python breaking every 50 so I can comment them out in steps of 50 easily. (I only tested in steps of 50)

I was actually most interested in the speed of big lists (500+) because I need that for kvasir, I see you only tested up to 200 elements, the way its set up the fast tracks (skipping 8) will probably not fire all that often even when merging 16 input list elements into 176 output list (the second to last merge) if the values are distributed semi-evenly over the 176.

I will not have time again until Monday but I will post more reproducible benchmarks and tweak it a bit, I think it still has more potential.

ldionne commented 8 years ago

Alright, I'll wait for your benchmarks to be up.

odinthenerd commented 8 years ago

I think I'm bench marking wrong but I have no idea where my error is. I'm on my laptop (couldn't sleep) and installed clang 3.7 with microsoft codegen and cranked up the "MSProject build output verbosity" until it gives me timing data. Then I compile this code:

#include <boost\hana.hpp>
#include <brigand\algorithms\sort.hpp>

template<int I>
using i = std::integral_constant<int, I>;

//200 elements
using l = boost::hana::tuple<i<26>, i<66>, i<88>, i<12>, i<12>, i<92>, i<8>, i<58>, i<59>, i<6>, i<79>, i<32>, i<93>, i<62>, i<89>, i<29>, i<32>, i<38>, i<96>, i<56>, i<75>, i<16>, i<32>, i<22>, i<21>, i<2>, i<41>, i<76>, i<68>, i<54>, i<29>, i<74>, i<77>, i<8>, i<31>, i<69>, i<89>, i<56>, i<5>, i<94>, i<42>, i<86>, i<100>, i<97>, i<1>, i<93>, i<10>, i<69>, i<97>, i<27>, i<34>, i<56>, i<51>, i<13>, i<83>, i<68>, i<7>, i<85>, i<93>, i<26>, i<45>, i<81>, i<6>, i<5>, i<53>, i<27>, i<84>, i<3>, i<21>, i<18>, i<28>, i<94>, i<3>, i<78>, i<44>, i<77>, i<70>, i<94>, i<31>, i<63>, i<60>, i<6>, i<16>, i<11>, i<8>, i<54>, i<81>, i<80>, i<12>, i<13>, i<71>, i<56>, i<100>, i<9>, i<33>, i<23>, i<2>, i<28>, i<67>, i<92>, i<73>, i<28>, i<35>, i<39>, i<45>, i<99>, i<85>, i<53>, i<30>, i<31>, i<96>, i<34>, i<95>, i<87>, i<32>, i<11>, i<1>, i<39>, i<44>, i<49>, i<30>, i<14>, i<100>, i<11>, i<0>, i<75>, i<55>, i<20>, i<81>, i<94>, i<30>, i<88>, i<74>, i<4>, i<97>, i<96>, i<54>, i<84>, i<56>, i<14>, i<19>, i<1>, i<88>, i<25>, i<83>, i<90>, i<22>, i<37>, i<88>, i<87>, i<21>, i<31>, i<39>, i<19>, i<24>, i<35>, i<41>, i<64>, i<24>, i<53>, i<94>, i<17>, i<68>, i<58>, i<37>, i<39>, i<42>, i<87>, i<5>, i<98>, i<26>, i<74>, i<82>, i<6>, i<60>, i<22>, i<75>, i<63>, i<64>, i<94>, i<69>, i<1>, i<58>, i<40>, i<30>, i<27>, i<63>, i<99>, i<19>, i<89>, i<38>, i<34>, i<77>, i<35>, i<83>, i<26>, i<21>, i<45>, i<40>, i<9>>;

auto sl = boost::hana::sort(l{});  //comment out line 1
using sl = brigand::sort<l>;        //comment out line 2

int main()
{
    return 0;
}

commenting out one of the two calls to the two respective sort algorithms I get: Brigand: 1>Task Performance Summary: 1> 0 ms AssignTargetPath 6 calls 1> 0 ms SetEnvironmentVariable 1 calls 1> 0 ms AssignProjectConfiguration 1 calls 1> 0 ms AssignCulture 1 calls 1> 0 ms Message 3 calls 1> 0 ms ReadLinesFromFile 1 calls 1> 2 ms SetEnv 4 calls 1> 2 ms Delete 1 calls 1> 4 ms MakeDir 9 calls 1> 5 ms Touch 2 calls 1> 5 ms WriteLinesToFile 1 calls 1> 406 ms Link 1 calls 1> 3827 ms ClangCompile 3 calls Hana: 1>Task Performance Summary: 1> 0 ms AssignTargetPath 6 calls 1> 0 ms SetEnvironmentVariable 1 calls 1> 0 ms AssignCulture 1 calls 1> 0 ms Message 3 calls 1> 1 ms ReadLinesFromFile 1 calls 1> 1 ms AssignProjectConfiguration 1 calls 1> 2 ms MakeDir 9 calls 1> 2 ms SetEnv 4 calls 1> 2 ms WriteLinesToFile 1 calls 1> 8 ms Touch 2 calls 1> 8 ms Delete 1 calls 1> 299 ms Link 1 calls 1> 22805 ms ClangCompile 3 calls

What am I doing wrong?

odinthenerd commented 8 years ago

More tests same method:

(elements in list is first number)

brigand: 100> 1987 ms ClangCompile 3 calls 50> 1701 ms ClangCompile 3 calls 40> 1589 ms ClangCompile 3 calls 30> 1386 ms ClangCompile 3 calls 20> 1384 ms ClangCompile 3 calls

hana: 100> 5929 ms ClangCompile 3 calls 50> 3893 ms ClangCompile 3 calls 40> 2386 ms ClangCompile 3 calls 30> 1583 ms ClangCompile 3 calls 20> 1401 ms ClangCompile 3 calls

ldionne commented 8 years ago

It's because you are instantiating the tuple in one instance (with Hana by calling l{}), and not with Brigand. My timings reported sorting a tuple in both cases, one with your algorithm as a back end and the other one with Hana's algorithm.

ldionne commented 8 years ago

There might be something else; my benchmarks only measured the worst case (a list sorted in reverse order). Perhaps this distribution of the elements in the list interacts in some funky way with memoization and skews the benchmarks. I'll see. But in all cases, I think we can be pretty confident that Brigand's sort is efficient enough for practical use cases, so this is mostly academic (and a possible improvement for Hana).

edouarda commented 8 years ago

This is also a moving target as compilers evolve. If we get satisfactory compilation speed, I am happy.

odinthenerd commented 8 years ago

Seems like reverse order should be a strong suit of the brigand implementation since fast tracking should fire more often than normal. Can't wait to do more benchmarks :)

edouarda commented 8 years ago

It's funny when you think that the fast lane approach was actually born out of the necessity to support MSVC 2013.

odinthenerd commented 8 years ago

Looks like your 'worst case' is actually where hana::sort does best. Its actually logical, if its getting elements in the exact reverse order it never has to walk the out list, the front is always the right spot for the next element:

list in reverse order : 2241 ms already sorted list : 9777 ms random numbers : 5179 ms

here is the code:

template<int I>
using i = std::integral_constant<int, I>;

using l = boost::hana::tuple<i<79>, i<67>, i<56>, i<28>, i<24>, i<58>, i<22>, i<16>, i<91>, i<48>, i<54>, i<94>, i<73>, i<51>, i<11>, i<29>, i<77>, i<48>, i<63>, i<69>, i<30>, i<48>, i<58>, i<8>, i<30>, i<18>, i<78>, i<52>, i<64>, i<99>, i<48>, i<46>, i<67>, i<2>, i<37>, i<95>, i<66>, i<3>, i<43>, i<48>, i<30>, i<85>, i<89>, i<45>, i<31>, i<98>, i<35>, i<7>, i<29>, i<43>, i<55>, i<19>, i<68>, i<23>, i<31>, i<70>, i<92>, i<50>, i<20>, i<51>, i<88>, i<81>, i<55>, i<29>, i<84>, i<93>, i<38>, i<72>, i<92>, i<40>, i<88>, i<89>, i<45>, i<42>, i<93>, i<61>, i<19>, i<79>, i<61>, i<28>, i<88>, i<56>, i<93>, i<85>, i<1>, i<25>, i<23>, i<87>, i<92>, i<20>, i<98>, i<69>, i<59>, i<23>, i<26>, i<52>, i<46>, i<3>, i<22>, i<6>>;
using l2 = boost::hana::tuple<i<0>, i<1>, i<2>, i<3>, i<4>, i<5>, i<6>, i<7>, i<8>, i<9>, i<10>, i<11>, i<12>, i<13>, i<14>, i<15>, i<16>, i<17>, i<18>, i<19>, i<20>, i<21>, i<22>, i<23>, i<24>, i<25>, i<26>, i<27>, i<28>, i<29>, i<30>, i<31>, i<32>, i<33>, i<34>, i<35>, i<36>, i<37>, i<38>, i<39>, i<40>, i<41>, i<42>, i<43>, i<44>, i<45>, i<46>, i<47>, i<48>, i<49>, i<50>, i<51>, i<52>, i<53>, i<54>, i<55>, i<56>, i<57>, i<58>, i<59>, i<60>, i<61>, i<62>, i<63>, i<64>, i<65>, i<66>, i<67>, i<68>, i<69>, i<70>, i<71>, i<72>, i<73>, i<74>, i<75>, i<76>, i<77>, i<78>, i<79>, i<80>, i<81>, i<82>, i<83>, i<84>, i<85>, i<86>, i<87>, i<88>, i<89>, i<90>, i<91>, i<92>, i<93>, i<94>, i<95>, i<96>, i<97>, i<98>, i<100>>;
using l3 = boost::hana::tuple<i<100>, i<99>, i<98>, i<97>, i<96>, i<95>, i<94>, i<93>, i<92>, i<91>, i<90>, i<89>, i<88>, i<87>, i<86>, i<85>, i<84>, i<83>, i<82>, i<81>, i<80>, i<79>, i<78>, i<77>, i<76>, i<75>, i<74>, i<73>, i<72>, i<71>, i<70>, i<69>, i<68>, i<67>, i<66>, i<65>, i<64>, i<63>, i<62>, i<61>, i<60>, i<59>, i<58>, i<57>, i<56>, i<55>, i<54>, i<53>, i<52>, i<51>, i<50>, i<49>, i<48>, i<47>, i<46>, i<45>, i<44>, i<43>, i<42>, i<41>, i<40>, i<39>, i<38>, i<37>, i<36>, i<35>, i<34>, i<33>, i<32>, i<31>, i<30>, i<29>, i<28>, i<27>, i<26>, i<25>, i<24>, i<23>, i<22>, i<21>, i<20>, i<19>, i<18>, i<17>, i<16>, i<15>, i<14>, i<13>, i<12>, i<11>, i<10>, i<9>, i<8>, i<7>, i<6>, i<5>, i<4>, i<3>, i<2>, i<0>>; 
auto sl = boost::hana::sort(l{}); 
odinthenerd commented 8 years ago

the difference between average and reverse order seems to increase exponentially, which is also logical: 100 elements: list in reverse order : 2241 ms already sorted list : 9777 ms random numbers : 5179 ms

150 elements: reverse order 2261 ms already sorted 24699 ms random 10480 ms

200 elements: reverse order: 2457 ms already sorted: crashes random numbers: 13951 ms

ldionne commented 8 years ago

Shit. Let's transfer this to https://github.com/boostorg/hana/issues/256.