remove_if has max instantiation depth issues

odinthenerd commented 8 years ago

Since its recursive it will only go to about 250 elemts on clang and 500 elements on MSVC. Its not so trivial to fast track though since both removing most elements and removing only a few elements are possible scenarios.

I think the strategy of dealing with the last test results and testing the next type (and sometimes fast track) all at once strategy I have been using is not so suited this time. Since we are going to do at least N instantiations fo the predicate we could just do them once with pack expantion and pass the result together with the inputs to the recursive function, that would make it easier to write the recursion and then just do pattern matching on 8 falses or 8 trues as a fast track. The problem is that the instantiation depth would then be very dependent on the data and thus far less deterministic.

What do you guys think?

On a side note I wonder if exploiting the fact that x{(Ts*)0...} could be either a call to a constructor with a variadic pack if types are different or a call to a constructor with an initializer list when all type are the same could add some speed to all or none.


struct AllSameType {
    bool value;
    constexpr AllSameType(...) :value{ false } {}
    template<typename T>
    constexpr AllSameType(std::initializer_list<T>) : value{ true } {}
};

template<typename...Ts>
constexpr bool all_same() {
    return AllSameType{ ((Ts*)0)... }.value;
}

static_assert(all_same<int, int, int>(), "");
static_assert(!all_same<bool, int, float>(), "");

odinthenerd commented 8 years ago

I just realized that remove_if could actually be expressed as join<split<list,pred>>. We could fast track join deterministically but split has the same problems of non deterministic fast tracking as remove_if does.

We could rip off 16 elements at a time, call remove if on them and then join the result on to our privious results, that should give us N/16 + 16 instantiation depth but I'm not sure it would be any faster. Although if we fast tracked on all 16 ture or all 16 false we could sure speed up the "typical" case.

ldionne commented 8 years ago

You can do it with just N instantiations of the predicate and no recursion at all. See how filter (remove_if_not) is implemented in Hana here.

ldionne commented 8 years ago

Basically you instantiate the predicate on every type of the list and store the result in a constexpr array. Then you generate a constexpr array containing the indices that must be kept in the resulting list, and you create a new list with those elements. If you use an amortized O(1) at metafunction to get the n-th element (like the multiple inheritance technique presented in this article), it's pretty fast.

edouarda commented 8 years ago

I'm afraid this won't be MSVC friendly @ldionne

odinthenerd commented 8 years ago

Its a cool idea though!

edouarda commented 8 years ago

Worth a try maybe?

odinthenerd commented 8 years ago

If we wanted it to be C++11 compatible we would have to build the list of indexes using a recursive constexpr function, not 100% sure if it would still be faster than a recursive template with a fast track but probably would be.

edouarda commented 8 years ago

We can improve other parts first.

ldionne commented 8 years ago

@porkybrain You're right, I forgot you guys have to make it work on MSVC. Actually, this might not even work on GCC, since its support for constexpr is quite broken. Nevermind.

odinthenerd commented 8 years ago

I would like to get it working faster. After getting sort so fast remove_if, split and to a lesser extent join are probable the biggest performance bottlenecks in kvasir.io. Especially since lists of 500 elements are quite possible and clang is gaining traction in the embedded field I would like it to work with clang defaults.

jonathanpoelen commented 8 years ago

remove_if<L,Pred> = partition<L,Pred>::second_type and partition works with more than 1800 elements with clang.

brunocodutra commented 8 years ago

Just found out about your work on Brigand, which looks very similar to my work on Metal, so perhaps we can share some ideas.

You may implement removeif and also other similar algorithms using [tree-like folds](https://en.wikipedia.org/wiki/Fold%28higher-order_function%29#Linear_vs._tree-like_folds) that only recurse Log2(N) times (I call it reduce). Basicaly you transform the list using variadic expansion so that every element that should stay in the list is mapped to a list containing this single element, while elements that should be removed are mapped to an empty list. Then you reduce the list of lists, into a single list by concatenating the inner lists pairwise.

I've successfully tested this approach for lists containing several thousands of elements without needing to increase the limits of recursive instantiation. Please refer to metal::remove_if

odinthenerd commented 8 years ago

not a bad idea, how does your join using reduce compare to just a plain old fast tracked recursive join in compilation speed? Seems like something like this (excuse my lazy psydocode)

template<typename...T1s, ... typename T16s, typename... Ls>
struct join<list<T1s...>, ... list<T16s>> : join<list<T1s..., ... list<T16s>>, Ls...>{};

should be faster on smaller (less than 200 elements) lists any way. I do agree there is a point where N/16 + N%16 is not less than Log2(N) but I would also argue that the constant is much higher in your case as well. The question is whether lists which are big enough to merit your method are actually practical in context.

brunocodutra commented 8 years ago

Well to be honest I have ruled out such optimizations as unrolling recursive instantiations during the development of Metal, as it leads to excessive repetition of boilerplate code, which in turn makes the abuse of the preprocessor too tempting to resist and I made the point keep Metal free of macros and the LOC count as low as I possibly could. ~~Later this evening~~ I will give you exact numbers, but if I remember it well it takes just a second or two on my personal laptop to join something like 1000 lists.

@porkybrain Just benchmarked metal::join on my personal laptop for a thousand distinct pairs, which is important to note, otherwise results would be biased by template memoization.

compiler	time elapsed
GCC 5.3	4.6 s
Clang 3.9	2.7 s

Most probably a fast tracked recursive join would be much faster than a purely reduced join for a sufficiently small number of elements, but perhaps you could combine both approaches, I mean, instead of reducing the problem down to joining lists pairwise, it could be reduced down to groups of 16 or less (assuming you unroll it up to 16). That would probably overcome your limitations of recursion depth limits while keeping it fast for smaller number of elements.

Alternatively, if you guys are determined to make it as fast as possible, you could experimentally define some N above which you use reduce and below which you fall back to fast tracked linear recursion in a way to never exceed recursion depth limits.

odinthenerd commented 8 years ago

I hate the preprocessor too ;) Fast tracking usually only adds one specialization to an already existing template so its not going to bloat too much code.

I'm not sure I speak for the group of devs here but I am really interested in getting compile time down and max size up on a few particular algorithms here.

Here is an implementation of what I described above:

    namespace detail
    {
        template< typename L1, typename...Ts>
        struct super_join
        {
            using type = L1;
        };

        template< typename... Ts, typename... Us, typename... Vs >
        struct super_join<list<Ts...>, list<Us...>, Vs...> : super_join<list<Ts..., Us...>, Vs...> {};

        template< typename...Ts, typename...T1s, typename...T2s, typename...T3s, typename...T4s, typename...T5s, typename...T6s, typename...T7s, typename...T8s, typename...T9s, typename...T10s, typename...T11s, typename...T12s, typename...T13s, typename...T14s, typename...T15s, typename...T16s, typename... Us >
        struct super_join<list<Ts...>, list<T1s...>, list<T2s...>, list<T3s...>, list<T4s...>, list<T5s...>, list<T6s...>, list<T7s...>, list<T8s...>, list<T9s...>, list<T10s...>, list<T11s...>, list<T12s...>, list<T13s...>, list<T14s...>, list<T15s...>, list<T16s...>, Us...> : 
            super_join<list<Ts..., T1s..., T2s..., T3s..., T4s..., T5s..., T6s..., T7s..., T8s..., T9s..., T10s..., T11s..., T12s..., T13s..., T14s..., T15s..., T16s...>, Us...> {};

        template<typename T, typename U>
        struct remove_if_impl;
        template<typename U>
        struct remove_if_impl<list<>, U> {
            using type = list<>;
        };
        template<template<typename...> class L, typename Pred, typename...Ts>
        struct remove_if_impl<L<Ts...>, Pred> {
            using type = brigand::wrap<super_join<typename std::conditional<brigand::apply<Pred, Ts>::value, list<>, list<Ts>>::type...>, L>;
        };
    }
    namespace lazy
    {
        template <typename L, typename T>
        using remove = typename detail::remove_if_impl<L, std::is_same<brigand::_1,T>>;
    }

    template <typename L, typename T>
    using remove = typename lazy::remove<L, T>::type;

super_join was just needed because brigand::join is not fast tracked.

this remove/remove_if is about 5x faster than what is in brigand now and does not have instantiation depth issues.

odinthenerd commented 8 years ago

@brunocodutra I like the purity of your library, however I'm not sure what niche it fills. Seems to me either you have a modern compiler, in which case the constexpr metafunction interface of hana is pretty hard to pass up, or you don't in which case the slightly less stringent (and maybe slightly faster when we're done) brigand seems like a better fit. I'm not sure there is a lot of room in the middle.

brunocodutra commented 8 years ago

@porkybrain I figured you would have to provide specializations for every number of elements from 0 up to some large N, but it does seem to be an interesting compromise if you just provide one specialization for some large N and rely on linear instantiation for number of elements smaller than N. I hope it doesn't trigger any abiguity though, but indeed it doesn't look like it would.

brunocodutra commented 8 years ago

@porkybrain perhaps I didn't get you right, but I don't think Metal is any more stringent than Brigand, I mean it works on Clang 3.2+, GCC 4.8+ and MSVC 14 (that I know of) and I've only dropped support for GCC 4.7 and MSVC 12 halfway the development, when I realised I was wasting too much time on pleasing stupid nonconforming compilers. I guess I could work around their anoying bugs, but so far I haven't felt the need for it.

My goal with Metal was to fit the exact nich Boost.MPL does, only on more modern compilers, solving its unmaintainability problem in the long run. In fact I just proposed merging Metal and Boost.MPL on the developers list and that's when I was told about Brigand.

Also, even if it could be faster, I don't see the point if we consider pure type computations, I mean, a couple of seconds for a few thousand elements should be fine considering one should rarely need to deal with more than a few dozen elements, or do you guys have an use case to prove me wrong? I'd love to know about it.

odinthenerd commented 8 years ago

@brunocodutra the Kvasir library encodes chip specific SFR rules into types and performs lazy evaluation and optimization on lists of these types at compile time. One thing with huge (10x) optimization potential is the initialization during boot up of these small chips, however on a big design that can mean 2k register interactions all in one list. Thats what I'm working towards. Also I could imagine state machines with more than 1k of possible transitions (especially if one make the declaration of state machines sexier). I think the use cases for fast MPL libs are missing because fast MPL libs are missing. Essentially there are two kinds of users, those who will upgrade from MSVC12/GCC4.7 in like 10 years and those who will in 1 year, those who will in 1 year are going to migrate to hana in 2 years ;) Don't get me wrong I like your lib a lot, I'm just playing the contrarian.

brunocodutra commented 8 years ago

@porkybrain oh no need to worry :)

Interesting, I'd never guess such use cases existed, the most demanding task I ever attempted with TMP was implementing unification and SLD resolution to create a Prolog framework to aid in the formal definition of types, that's when I realized Boost.MPL wasn't up to the task anymore these days and that it was beyond any possibility of modernization.

Anyways, as we are diverging from the topic, I'd just like to recommend you take a deeper look into Metal, as I have spent a lot of time tweaking specific tricks to boost performance of specific algorithms. Sometimes these are not obvious AT ALL and might present enormous differences in performance between similar compilers. A couple in particular I'd recommend you took a look at are metal::distinct and metal::enumerate which really took me a while to get right.

odinthenerd commented 8 years ago

@brunocodutra I thought of another good example, if you are doing matrix math, preferably lazily, each element of your matrix could also be a different type if you are also using dimensional analysis (like boost unit). Even for a relatively small 3 dimensional matrix you are dealing with a lot of elements. I recently heard of the need for a matrix library which tolerates heterogeneous types and goes up to something like 10 dimensions with lazy evaluation. I personally have no idea what the guy was calculating (something with MRT imagery) but there certainly is a need for super complex MPL calculations in many domains.

Giving your lib a closer look it is quite evident that you have a high skill level and are a creative metaprogrammer (there are many who know all the MPL design patterns by heart but few who are creative, hence creativity is king). On the other hand there are probably 20-30 guys out there who have written a Boost.MPL successor. I was one of them (with KMPL), however the few lines of code I have contributed to brigand are no doubt of far higher value than the few thousand lines I wrote for KMPL (which is dead now).

Of the several Boost.MPL like libs I know of yours is the best I have seen so far with regard to purity, however in our niche purity is not king. Louis is changing the metaprogramming paradigm with hana and I am a firm believer that we are all going to be using his lib or something like it in a few years. Hence he can make requests to the compiler devs or the standard committee to make his pure lib fast. That pretty much defines the niche we (brigand, metal, MPL11, KMPL etc.) live in and the constraints to it. Here I only see two important factors, speed and old compiler support. We can't really go to the compiler devs and ask for our pure code to be made fast, any compiler version that they will make changes to will also support hana which (sorry brigand guys) trumps brigand. Our niche is to use every dirty trick in the book to squeeze the most performance out of broken compilers ;). At least that's brigand's mission as far as I see it (I have basically no say in what goes on here so take that with a grain of salt).

I probably come off as having a pet peeve with regard to speed, however I know of two projects (Kvasir and a new state machine lib) who's fate rests upon a 5x improvement in the speed of brigand. I am willing to bet there are literally hundreds more. With sort we have gotten there, with remove_if I'm sure we will get there, maybe we will need to make a few versions for different compilers and #ifdef around like heathens but I am willing to bet that the fastest lib will win in this niche because people need the speed.

In conclusion sorry for spamming up the thread and I would strongly advocate for the merging of Metal and brigand (which will occur either way because I am looking at all your good ideas and thinking of how to get them to work with MSVC12). Or if you are more interested in going down in history then helping louis with hana would be my best bet on how to do that, writing the thing that hana will have crushed probably isn't ;)

brunocodutra commented 8 years ago

@porkybrain I'm glad to realize you perceive Metal exactly as I intended it to be. We only disagree in that I do believe there's something between ultra fast Brigand and super expressive Boost.Hana. Sometimes all one needs is some simple way to express a few concepts (thing C++17) or just do some mainstream metaprogramming. For such cases Metal will do.

Now do please put Metal to use. I did spend a lot of time trying to think out of the box and find new clever ways to express simple algorithms and I'd be glad if you find anything of use to you. Also, don't hesitate to reach me out if you need any help.

edouarda commented 8 years ago

@brunocodutra you are welcome to contribue to Brigand if you feel it lacks features and I'm sure @ldionne will welcome suggestions if you have ideas for wrapper around Boost.Hana.

odinthenerd commented 8 years ago

@brunocodutra I'm not sure the 'casual' metaprogrammer exists, in my mind there are only few hundred of us and we mostly build libraries/infrastructure for everyone else.

jfalcou commented 8 years ago

OK, back on track. What's the status on this ??

odinthenerd commented 8 years ago

I found a good solution leveraging append, only bench marked it on clang so far but it was much faster, I will make a pull request tomorrow.

odinthenerd commented 8 years ago

@brunocodutra brigand 16x fast tracked join is more than 10x faster than metal join with list of 1000 pairs on my maschine (on gcc and clang). Fast tracking is crucial if you are working with big lists. Add me on linked in or send me an email for further discussion (no need to hijack issues here ;))

edouarda / brigand

remove_if has max instantiation depth issues #127