Consistent argument order, Vectorization

bramtayl commented 9 years ago

It would be convenient to have data arguments consistently as the first argument. This is particularly useful for chaining. A few examples where the argument order is puzzling: convert, ismatch.

StefanKarpinski commented 9 years ago

This issue seems too broad to be of much practical use.

JeffBezanson commented 9 years ago

In cases like this it's best just to give an exhaustive list of suggested changes. Everything is data, so I'm not sure what a data argument is.

The argument order for convert is very firmly established, and this function is extremely important so we're not going to change it. But please feel free to list other examples.

JeffBezanson commented 9 years ago

To elaborate a bit, the argument order for convert matches call; convert(T,x) and T(x) are related.

bramtayl commented 9 years ago

I'm new to Julia, but I can list examples as I come across them. Anything in the match family, so ismatch, match, eachmatch, matchall. map and broadcast are also not very compatible with chaining, but I can see why the argument order makes sense. I guess by data I mean the argument that's most likely to be chained. This might be a keep-in-mind-for-the-future issue rather than a go-back-and-change-everything issue. It seems like in Julia, function arguments and type arguments tend to come first (perhaps in resonance with call), which has the side effect of filling my code with a bunch of underscores (via @as _ begin).

JeffBezanson commented 9 years ago

Making the function the first argument to map is universal. Is there even one language that doesn't do that?

The match functions are more debatable, but python uses this argument order, as do other OO languages like ruby which uses re.match(string).

You are right that julia's libraries were not designed with this chaining thing in mind. Fully admitting my lisp bias, I've never wanted many different syntaxes for function calls.

bramtayl commented 9 years ago

R's apply functions and plyr functions all have data arguments first (with the notable exception of mapply).

Edit: not to mention as.numeric, as.character, and family

johnmyleswhite commented 9 years ago

And the notable exception of R's Map function.

bramtayl commented 9 years ago

Another one is write

JeffBezanson commented 9 years ago

Well then we appear to be at an impasse. It looks like vast numbers of key functions (write, convert, map, ...) are incompatible with this redesign. As you point out, changing all of these is not really practical. However these functions are so fundamental that it wouldn't be practical to do things differently in the future either; certainly we can't have half of our I/O functions use f(obj, io) and half go the other way.

simonster commented 9 years ago

I think there's an argument to be made for switching the argument order for match. At present it's inconsistent with search, replace, and the rest of the string functions.

nalimilan commented 9 years ago

...but findin uses the same order as match. The good thing with match and co. is that their arguments can easily be swapped and a deprecation added, there's no ambiguity thanks to the Regex argument type. That's harder for findin, though the function could also be renamed to work around this (cf. https://github.com/JuliaLang/julia/issues/10593).

I'm all for consistency, though honestly I would rather have changed all functions to search for their first argument in their second one. Maybe that's just me. Anyway, for chaining, it's not clear to me whether you'd more often pass one or another (both are "data").

bramtayl commented 9 years ago

Ok, here's a make-shift solution. It allows users to either vectorize or switch around the arguments of functions or both.

using Lazy

# convert singletons to a 1 entry vector
function to_array(x)
    if (@> x typeof) <: AbstractArray
      x
    else
      [x]
    end
end

# switch the first and second items in a tuple
function switch_tuple(tuple)
  if (@> tuple length) == 2
    index = [2, 1]
  else
    index = [2, 1, 3:(@> arguments length)]
  end
  tuple[index]
end

# return an expression, which suffixes a function and reorders the arguments
function switch(function_symbol::Symbol)
  suffixed_function_string = string(function_symbol) * "_s"
  suffixed_function_symbol = @> suffixed_function_string parse

  quote
    function $suffixed_function_symbol(arguments...)
      $function_symbol((@> arguments switch_tuple)...)
    end
    @> $suffixed_function_string parse
  end
end

# return an expression, which suffixex a function 
# and maps/broadcasts a function over an argument/arguments respectively
function vectorize(function_symbol::Symbol)
  suffixed_function_string = string(function_symbol) * "_v"
  suffixed_function_symbol = @> suffixed_function_string parse
  quote
    function $suffixed_function_symbol(arguments...)
      arguments = map(to_array, arguments)
      if length(arguments) == 1
        map($function_symbol, arguments...)
      else
        broadcast($function_symbol, arguments...)
      end
    end
    @> $suffixed_function_string parse
  end
end

# vectorize some functions
@> :vectorize vectorize eval
@> :switch vectorize eval
@> :eval vectorize eval

# lists of functions to reverse and vectorize or just vectorize
reverse_and_vectorize = [:ismatch, :write, :convert]
just_vectorize = [:replace]

@> begin
  # first reverse functions in reverse_and_vectorize
  reverse_and_vectorize
  switch_v
  eval_v
  # add in just_vectorize functions
  vcat(just_vectorize)
  # vectorize
  vectorize_v
  eval_v
end

# test
@> ["a", "b"] ismatch_s_v(r"a")

See also JuliaLang/julia#8450

The argument switch suffers (I think) from unnecessarily copying the arguments. Is there a way to unpack a tuple in a particular order?

Edit: fixed anonymous function issue

yuyichao commented 9 years ago

@bramtayl why anonymous function (bound to non-const global) to_array ?

bramtayl commented 9 years ago

Oops I'm still getting used to Julia style function definitions. See edit. It's there to be able to broadcast singletons.

JeffBezanson commented 9 years ago

That's kind of clever. But I don't see how this is a huge improvement over write(io, _), or just using normal syntax. If you know about write, it's easy to see what write(io,_) does, while write_s(io) and @> seem pretty obscure to me.

I also think (#8450) that ideally iteration is something you do with an existing function, not something that requires a new definition for each function. Nobody should decide which functions get _v versions; you can write map(f, x) when needed. Or maybe this could be part of the operator; for example @.> A write(io, _) could mean for x in A; write(io, x); end. But again I would argue the for loop version is intelligible even to people who don't know the language.

bramtayl commented 9 years ago

The advantage of write_s would be that you can write out a chain without naming anything. Of course, names make debugging easier, but reading code harder.

@> begin
  text
  # a whole bunch of string processing
  write_s(conn)
end

With out that, your code would look like this:

@as _ begin
  text
  # string processing with a bunch of unnecessary _'s
  write(conn, _)
end

JeffBezanson commented 9 years ago

For me, the greater regularity of reusing the same write function, and not needing to set up a definition to make write_s exist, make the second version the winner. Maybe others will weigh in.

johnmyleswhite commented 9 years ago

Given that chaining makes many function arguments implicit (and therefore makes line-local reasoning more difficult), I generally think it makes code harder to read. I also agree that having a single canonical write function is more important than accommodating macro-based DSL's.

bramtayl commented 9 years ago

If I had to map, broadcast, or for loop (!) every time I use a function iteratively (pretty much always) AND had to write code that was riddled with under-scores, I'd probably give up and go back R. Consider the chain above without any help:

reverse_calls = map(switch, reverse_and_vectorize)
reverse_symbols = map(eval, reverse_calls)
both_symbols = vcat(reverse_symbols, just_vectorize)
vectorize_calls = map(vectorize, both_symbols)
vectorize_symbols = map(eval, vectorize_calls)

Useful for debugging, but it doesn't seem likely that any of these items cluttering up the environment will be used again.

or, with underscores:

@as _ begin
  reverse_and_vectorize
  map(switch, _)
  map(eval, _)
  vcat(_, just_vectorize)
  map(vectorize, _)
  map(eval, _)
end

Not even going to bother with for loops.

Yes, obviously there is such a thing as too much chaining. You might argue that the argument switching and vectorization should be done in two separate chains. But chaining also organizes code and clarifies structure.

nalimilan commented 9 years ago

If I had to map, broadcast, or for loop (!) every time I use a function iteratively (pretty much always) AND had to write code that was riddled with under-scores, I'd probably give up and go back R.

To me, a short vectorization syntax is a major need and that's what #8450 is supposed to deal with. In your example, the code would be much shorter already, and you might accept suffering a few underscores for chaining if vectorization allowed merging a few lines of the chain.

Also note that R does not provide any native support for chaining, so it's not like this kind of thing couldn't be done in Julia as well.

johnmyleswhite commented 9 years ago

As @JeffBezanson said before, we seem to be at an impasse. This issue seems primarily focused on code aesthetics and it seems that several other Julia developers don't share your aesthetic sensibilities.

It sounds like there are several specific functions, like match, that people would consider changing for consistency. But consistency for the sake of simplifying chaining doesn't seem sufficient to justify making so many breaking changes.

johnmyleswhite commented 9 years ago

For me the decisive issue when debating these kinds of DSL-specific concerns is this: given that you want alternative surface syntax for writing identical semantics, why not just write an actual DSL that gets translated to Julia code? Why does Julia syntax need to match the syntax of your ideal DSL?

I think people tend to overuse shared-parser DSL's for this kind of use case. If you want truly independent syntax, a separate-parser DSL is the way to go. It has higher start-up cost for the DSL-developer, but completely frees you from having to reach consensus with others about your preferred syntax.

bramtayl commented 9 years ago

If you play your cards right, you might get armies of @hadley followers switching to Julia in the next few years, all of which are pretty used to chaining (and the kind of things in DataFramesMeta). I'm certain I couldn't tackle writing a new language. But maybe a package?

johnmyleswhite commented 9 years ago

I, for one, don't see that as a goal worth pursuing given that I work on Julia in my free time. I'd vastly prefer having a language that can be used for the things that R will never be good at than a language that tries to emulate what R can already do well enough.

The problem with the idioms you're advocating for is that they don't come equipped with any fleshed out solutions to the issues of semantics that have held back work on #8450. The surface syntax of a replacement for vectorization is the least difficult part of what needs to be done to remove vectorization from Julia. The important issue is designing a set of semantics that's amenable to compilation to efficient code. That depends on progress on integrating functions into Julia's type system in such a way that multiple dispatch can operate effectively when using higher-order functions. See Jeff's thesis for some ideas about how this might be done and packages like FastAnonymous.jl for interim improvements.

bramtayl commented 9 years ago

For most applications, the bottleneck is how long it takes to write the code, not how long it takes to run the code.

johnmyleswhite commented 9 years ago

That's completely false when you work at scale.

bramtayl commented 9 years ago

Conceded. I thought the point of Julia was to be the best of both worlds. Otherwise, why not just write in Fortran?

ScottPJones commented 9 years ago

@bramtayl I think julia does give a lot of flexibility (more than I've ever seen elsewhere) to have the best of both worlds (although there are still a lot of rough edges, but those are being worked out), and maybe you can accomplish what you want in a package, with all of the power of multiple dispatch and julia macros behind you...

johnmyleswhite commented 9 years ago

The problem is that we don't have any means for reaching an agreement about what "best" means. My take on this issue is that many of the people involved in this thread have very substantial disagreements about what good code looks like. I'm skeptical that we can resolve such large disagreements about aesthetics by talking them through.

bramtayl commented 9 years ago

Ok, I'll just keep the code for personal use only.

JeffBezanson commented 9 years ago

I agree that productivity is incredibly important, but I don't see how something like chaining syntax is drastically more productive than our normal syntax. As for vectorization, if I thought writing map every time was a good solution, then #8450 would not be an open issue.

I just realized that it's a bit odd for chaining to work on the first argument. In languages with function currying, delayed arguments are added at the end. For example you could write

x |> map(switch) |> map(eval) |> vcat(_, just_vectorize) |> map(vectorize) |> map(eval)

because map(f) means x->map(f,x). Maybe our functions are designed more for this style.

bramtayl commented 9 years ago

Maybe it's worth working for consistency in the other direction then? Is that piping to the last argument or to the second argument?

JeffBezanson commented 9 years ago

Yes, that's quite possible. I think it should pipe to the last argument.

bramtayl commented 9 years ago

Here's an extension to whole-sale vectorize all the functions in a Module.


using Lazy
using DataFrames
using DataFramesMeta

@> :typeof vectorize eval
@> :eval vectorize eval
@> :string vectorize eval
@> :convert switch eval
@> :ismatch switch eval vectorize eval

function get_functions(m::Module)
  df = @> begin
    DataFrame(symbol = @> m names)
    @transform(
      is_function = ( @> begin
                       :symbol
                       eval_v
                       typeof_v
                      .==(Function) end ) ,
      compatible = ( @> begin
                      :symbol
                      string_v
                      ismatch_s_v( r"^[A-Za-z]" )
                      convert_s( Vector{Bool} ) end ) )
    @where(:is_function & :compatible) end

  df[:symbol] end

@> Base get_functions vectorize_v eval_v

ScottPJones commented 9 years ago

Could somebody please explain the use of _ in the above example? (again, sorry for the newbie question, it's just that the only thing I can find with Google is about IJulia history variables, and the JuliaLang docs can't seem to find anything that isn't a alphanumeric string...)

bramtayl commented 9 years ago

@as is a function from Lazy.jl. The example given in the Readme (worth checking out for context) is:

# @as lets you name the threaded argmument
@as _ x f(_, y) g(z, _) == g(z, f(x, y))

The benefit of is that you can specify exactly where you want the previous result to be piped into the next expression. It is needed in particular if there is not consistent method of figuring out where to pipe the previous result to (i.e. the first argument, the last argument, etc.). is only a symbol, and @as ~ would work equally as well were it not for interfering with formulas.

Jeff's currying example is from some other language, but you might assume a somewhat equivalent function.

JeffBezanson commented 9 years ago

whole-sale vectorize all the functions in a Module

I'm really not a fan of this. It's clearly the wrong abstraction: instead of the function and the iteration being treated as orthogonal (which they really are), it doubles the number of definitions in a module without regard for which of the new definitions actually make sense. Concepts should be composed using general mechanisms, not by concatenating names with underscores.

I continue to fail to understand the advantage of @> m names over names(m). Isn't this just deliberately obscure?

bramtayl commented 9 years ago

Yeah, I was trying to use chaining as often as possible for illustrative purposes. Might have gone a bit over-board. Agreed that doubling the number of functions in a module is a little ridiculous, but until #8450 gets sorted it it might be useful, especially if no one else starts using _v for something else.

bramtayl commented 9 years ago

It's also worth noting that the code above can be rewritten with Lazy's @>> which pipes to the last argument. This wouldn't have worked for other string processing functions like replace, though.

using Lazy
using DataFrames
using DataFramesMeta

@> :vectorize vectorize eval
@> :eval vectorize eval
@> [:typeof, :string, :ismatch] vectorize_v eval_v

function get_functions(m::Module)
  df = @> begin
    DataFrame(symbol = @> m names)
    @transform(
      is_function = ( @>> begin
                       :symbol
                       eval_v
                       typeof_v
                      .==(Function) end ) ,
      compatible = @>> begin
                      :symbol
                      string_v
                      ismatch_v( r"^[A-Za-z]" )
                      convert( Vector{Bool} ) end )
    @where(:is_function & :compatible) end

  df[:symbol] end

@> Base get_functions vectorize_v eval_v

Edit: an extension for multiple packages:

function make_functions(m::Module)
  quote
    @> $m get_functions switch_v eval_v
    @> $m get_functions vectorize_v eval_v switch_v eval_v
  end
end

@> :make_functions vectorize eval

@> [Base, Lazy] make_functions_v eval_v

StefanKarpinski commented 9 years ago

I have to say that I find this style of coding pretty inscrutable – it doesn't seem like an improvement in terms of readability or writability. But I'm glad that the macro system lets you experiment like this.

fcard commented 9 years ago

I heard from some people that they like threading/piping because it lets them always read code "left to right, top to bottom", and with nesting/composition they have to find where the expression starts and where it continues to.

Some like to reason about code by describing it with phrases and it's harder to come up with words to describe print(sum(map(x->x-10, map(x->2x, A)))) than it is to describe @>> A map(x->2x) map(x->x-10) sum print, the latter is pretty straight forward: "I have A, I multiply every element by 2, then I subtract 10 from every element, then I sum it, then I print it."

Some other people have said they get lost in nesting easily, always reading expressions all at once. And some other people just said threading is cooler looking :P

bramtayl commented 9 years ago

I have no idea how to extend this to work with macro functions, seeing as you can't use the splat operator with them.

carlobaldassi commented 9 years ago

the latter is pretty straight forward: "I have A, I multiply every element by 2, then I subtract 10 from every element, then I sum it, then I print it."

Which also exemplifies one additional usage pattern in which this style of appending operations at the end generally helps, that is building expressions step by step at the REPL while looking at the output, shell-style (if performance is not your primary concern, obviously).

kmsquire commented 9 years ago

Seems like this is a dup of #5571?

JeffBezanson commented 9 years ago

Yes, I think this discussion can be continued in #5571.

apetrushin commented 8 years ago

Making the function the first argument to map is universal. Is there even one language that doesn't do that?

Ruby list.map! {|x| x + 1 }.

Elixir Enum.map list, fn(x) -> x + 1 end.

JS list.map(function(x) { x + 1}).

Ugly Java list.stream().map(x -> x + 1).toArray().

While 4 of those are OOP, the order of arguments is kinda mimic the functional style when the list goes first and the function goes the second.

The Python and Clojure use another convention when function goes first.

JuliaLang / julia

Consistent argument order, Vectorization #11722