Futrell / cliqs

Crosslinguistic investigations in quantitative syntax
5 stars 5 forks source link

Testing the DLM hypothesis on context-free language corpora? #5

Open rht opened 7 years ago

rht commented 7 years ago

There could be universals that could be better uncovered with languages that are context-free. Programming lang treebanks are almost nonexistent, so what came to my mind if I were to construct one is to draw source from formalized mathematical proofs (that had been implemented in various langs) and established software protocols (that had been implemented in various langs).

Futrell commented 7 years ago

I've thought about this a bit. Have we met? If you're in the Cambridge area and you're interested in this stuff then it would be fun to meet.

I do think programmers prefer short dependencies in their code, and that the programming languages people find "easy to understand" probably have shorter dependencies than those that don't. For example, it seems that method-chaining styles for higher order functions like map and filter have been growing in popularity. Choosing method-chaining style over function-application style reduces dependency length. Compare:

map(h, map(g, map(f, x))) (long dependency length, relatively hard to understand) x.map(f).map(g).map(h) (short dependency length, relatively easy to understand---and trendy new languages usually do it this way)

In general, x.f(y) has shorter dependency length from functions to arguments (treating the object instance as an argument) than f(x, y). I think this is part of why LISP is often thought of as hard to understand, and why object-oriented languages with this syntax grew in popularity.

There aren't programming language treebanks, but it should be possible to measure dependency length in programming languages by parsing the language to an AST and comparing the true linear order to random reorderings with the same AST/semantics.

The issue is that programming languages rarely allow much freedom in "word order" with respect to a fixed AST: one of the few cases I can think of is ordering of keyword arguments. If DLM happens it is in more subtle ways, like method-chaining and strategic variable assignments. Really the right thing to do would be to compile to some representation of program semantics and compare DL in the attested program to DL in random programs that have the same semantics, but generating those random programs seems like a hard (possibly uncomputable) problem.

rht commented 7 years ago

To your question, given my hazy memory, I can't be sure, but possibly years ago, in a 'class'. And it is actually the else clause, though I had even considered sending an autonomous rover to circumvent the constraint if this is within the confine of laws. Or through fossilized thoughts just like this if this is allowed. As long as the hurdles don't get in the way in the 'stuff'.

Indeed, the expression (in effect, the DL) could be made further shorter with function compositions

x.map(h.g.f)

The nat lang version would be "electric toads, each, is trimmed, then bathed, then repaired."

In general, x.f(y) has shorter dependency length from functions to arguments (treating the object instance as an argument) than f(x, y). I think this is part of why LISP is often thought of as hard to understand, and why object-oriented languages with this syntax grew in popularity.

Whoa, a psycholinguistics explanation rather than "it might be a gag order imposed by the corporates, ¯_(ツ)_/¯" or some metaphors. Predating OO, at least infix notations have been used to minimize DL (there is t-expression though it trades parentheses for indentations, rather than reordering the words). With AST-DL as a quantitative measure of readability, it might succinctly explain why GOTO is considered harmful.

For the PL treebanks, some specimens would be http://www.cs.ru.nl/~freek/100/, and standard libraries of various language (e.g. linalg, http, fs io ...).

For the PL treebanks and the generated ones, I see the complications with the word order freedom in most PL's (why is such the case?). The remaining hypothesis that can be tested is if recent codes have shorter DL than the old ones, which could be why you brought up the usage frequency of function-as-a-prefix and method-invocation. Furthermore, I think the decreasing DL trend could be quantitatively checked on nat lang as well.