AyushmanTripathy / pipe-script

A programming language that revolves around piping
MIT License
8 stars 0 forks source link

Future? Roadmap? #6

Open dumblob opened 2 years ago

dumblob commented 2 years ago

I wonder what the future and the roadmap looks like?

I have also some (very) crazy ideas but I'd first like to learn about the goals and roadmap if any :wink:.

AyushmanTripathy commented 2 years ago

Sorry for the delay, i had 11th exams.

AyushmanTripathy commented 2 years ago

not any real roadmap, i am still fixing bugs and issues. if you have any ideas, lets do it

dumblob commented 2 years ago

Sorry for the delay, i had 11th exams.

Take your time, no need to hurry.

if you have any ideas, lets do it

I'm afraid my ideas are both sharply incompatible with certain goals of pipescript (judging based on readme and examples) but 100% compatible with certain other goals from the same readme and examples :sweat:.

So I'm not sure how you'll perceive the following largely unexplored ideas:

  1. create a comprehensive language for data manipulation (I'm deliberately not saying a programming language because that has some prejudicial scent to it - i.e. no stop-the-world before function entrance and resume after return, but rather focus on data manipulation in time without stopping the world)
  2. feedback loops (basically arbitrary cycles - feed a (part of) output back as (part of) input)
  3. splitting & muxing (both with and without breaking the full ordering)
  4. merging & demuxing (both with and without breaking the full ordering)
  5. changing frequency
  6. reshaping data samples
  7. (recursive) windowing/buffering & stepped/window-delimited processing (to break full ordering into partial ordering and back)
  8. synchronous (blocking) joining/merging/demuxing
  9. asynchronous (non-blocking) joining/merging/demuxing
  10. recoding data samples (useful for integrations, compatibility with outer world, performance optimization, etc.)
  11. runtime arbitrary creation & deletion of pipe_instances/pipelines based on runtime inputs
  12. full AOT type safety (incl. null/none/nil safety) - this is vague but basically ensuring all inputs of all commands in pipelines have known format at the latest at the very moment of invoking the application
  13. pipe-based-feedback-loop error handling (i.e. no "side channels" - no exceptions, no effects, nothing like that but just pipes - any error shall be treated as just another regular pipe)
  14. push/pull semantics choice anywhere in the graph
  15. pipes in pipes (i.e. quasi-recursive embedding of infinite streams)
  16. pipes can be heterogeneous (or "trunks" of homogeneous pipes?)
  17. ...

This has many use cases as you know - from easy data exploration through easy writing of true apps up to having no-effort UIs (implicit insertion of | prettyprint in nushell, https://github.com/calebwin/stdg etc.). And the best of all - it inherently supports manycore parallelism (see https://binpa.sh/docs/tutorial/ )! Yeah - a (first ever?) low-effort low-code fully parallel data manipulation (programming?) paradigm!

Crazy, isn't it?

I also have some basics of the corresponding syntax, but that's not important compared to having a clear mutual agreement for goals :wink:.

AyushmanTripathy commented 2 years ago

i don't know much about data manipulation ( splitting & muxing these are going over my head ) but i am happy to learn. it will be very good for pipe-script project to have any real goal.

dumblob commented 2 years ago

Ok, if you didn't set after any firm goals yet, then I'd be more than happy to provide some input for an interesting technology and also a thesis (I have experience in mentoring university students as an external/non-university person).

What I'm describing is a bit like creating a streaming data computation graph using a simple syntax.

ad 2) I want it to have general computational power. Any signal processing needs feedback, any simulation of digital HW (e.g. circuits from logic gates) needs feedback, etc.

I have some very shallow thoughts of how to handle the complexity this might lead to. So I'm aware of certain edge cases (most of which should be "solved" by supporting both sync & async computation).

ad 3 & 4) Here I'd like to achieve "partial ordering". Imagine you have an expensive computation you want to run on many cores but you have only one input. Then I'd like to have an option to use the input as a common queue for work-stealing - each core would steal the current sample/item (or batch of those for speed) from the queue. Assume that each core does a chain of operations (not only one) and that these operations might again need to "split" across several other cores (i.e. nested "splitting"). Then I'd like to assemble the output from all these cores again to one queue. Here I'd like to have 2 options - either assemble them as they come (in "random" order) or assemble them in the original order (e.g. based on nested tagging done when splitting).

ad 5) I'm imagining this in a discrete domain (i.e. no continuous/smooth function but rather a step function). Thus defining it as a relative measure to other streams (i.e. closer to how network traffic shaping is being done - i.e. by delaying some samples based on given measure to achieve the desired "average" throughput).

ad 6) This is one of many potential (conceptual) solutions to the problem of outputting more than one stream from a command - merge of two streams of 64bit ints could e.g. result in a stream of structures containing two 64bit ints (and padding if needed).

Reshaping would then either add or remove or swap or rename "fields" (actually representing streams at one point in time) in such a structure. Reshaping shall ideally encompass both physical representation (imagine a C struct with fields and padding and alignment) and logical representation (names of fields, their order if any, value constraints over all fields together - please do not confuse with types as types do specify constraints only over each field separately and actually are usually very limited).

For synchronous computation, it's clear how to populate such struct. For async I'd simply take the currently newly computed value for a given field and then took old most recent values of all other fields to populate the struct. But again, this is just a crazy not very thought-through idea (it actually emerged from the need to allow to run all commands in parallel - therefore so much copying).

ad 8 & 9) This probably deserves more explanation due to narrow semantics of this topic in most programming languages. What I mean is "true" synchronicity and asynchronicity as seen in physical world (though in real world there seems to be everything async if not talking about things like quantum entanglement etc., but usually we define things to happen synchronously if the measured difference of absolute time is less than x - it's easier to understand if the time is discrete (a counter) like in CPUs).

Also this seems to be one of the rather major differences compared to many (all I know so far) "pipe languages".

I imagine that I could specify a set of subgraphs where each subgraph would either await each other to provide one new sample as a result of their computation (this is synchronous) or to not await each other to provide one new sample and just proceed further (this is asynchronous).

I have some very preliminary syntax ideas for this (unrelated to pipe-script and thus incompatible with it). But the main point is the semantics as described.

ad 10) Assuming our language will deliberately not specify how the streams will be implemented on inside, there has to be a way to read (as a source) & write (as a sink) data in a specific format. If we had reshaping as suggested in (6), this wouldn't be a problem as reshaping covers the physical representation (endianess, sizes, paddings, signedness, etc.). But as I wrote, it's just an idea and something else (better) might be used. In which case we'll need an explicit generic way how to e.g. read & write serialized data (e.g. in JSON or BSON or FlatBuffers or any other of those gazillions of existing & future formats).

ad 11) Here I'd actually be fine even with a more constrained variation - like requiring to know in compile time how the to-be-deleted or to-be-added graph will look like and only care about it's (de)instantiation in run time.

ad 13) My idea is to forget completely about "errors being errors" but simply treat any error from a given instance of a command as yet another full-featured output queue (i.e. the command would be seen as multiple-output "subgraph").

ad 14) This is merely a note, that e.g. in case the graph(s) should be deployed as a distributed computing system (cores connected with some type of unreliable network), then there might be the need to specify which of the two commands with pipe in between shall (re)initiate the communication.

ad 15) I didn't think of all the edge cases but my initial idea was rather syntax-oriented than semantics. Because I see "streams of streams" as simply a "trunk" or "set" of streams but emphasizing that a command (disregarding whether source or sink or both (aka filter)) has the liberty to "open" and/or "close" any new or existing stream any time. In other words, this point merely holistically reiterates on many of the other points above :wink:.

ad 16) This "heterogeneous or not" is partially a philosophical question, partially an implementation detail and partially an important decision affecting how the pipe language shall be structured :wink:.

It's just another of the possible solutions to multiple outputs from one command (in addition to the "struct" concept explained in the context of reshaping in (6) ). We could actually put samples representing different outputs into one stream but in a serial manner instead of "next to each other as described in (6)".

This way any command would first need to read the sample and dynamically in run time determine its "type" (incl. length/size, padding, etc.) because generally it wouldn't be easily possible to determine this with some clever analysis up front in compile time. This might have some performance influence - probably less copying than the "struct" approach but also more processing due to a "switch/branch" inside of nearly every single command. Again, this is not thought-through and all rather preliminary...

Any comments, ideas, thoughts very much welcome and appreciated!


My highly preliminary syntax for such a language should be closer to this than to others:

$fizzbuzz { gen (-999 999 1) filter (x >= 1 && x <= 100)
  @ branch (x % 3 == 0 && x % 5 == 0) write (stdout FizzBuzz)  # FizzBuzz alias got stringified in `write` due to "disguising" as described below
  @ else
    @ branch (x % 5) write (stdout Buzz)
    @ else
      @ branch (x % 3) write (stdout Fizz)
      @ else trash
}  # defines & immediately executes the block with no arguments
fizzbuzz  # executes the same block again

{ args filter ... ;; log (howk) } (just three arguments)  # define a block which accepts
                                                          # some streams as arguments, then execute
                                                          # this block with arguments `just` `three` `arguments`
                                                          # (note ;; is syntax sugar to delimit independent
                                                          # subgraphs executed in parallel)
{ args filter ... } ()  # this will compile fine
{ args filter ... }  # same as previous line (but with syntax sugar allowing to omit the empty list)

Some underlying ideas:

  1. All "commands" are aliases by default (there is no way to express a "real unaliased command"). Most commands should work transparently with streams from real unaliased commands as well as from aliases though. There are special commands working only with aliases and their results (if any) are "injected" into the given lexical scope. Any subgraph containing a command working on aliases is executed in compile time. After that, what's left will be lowered to "real commands" eventually (actually it'll get lowered to a special object with defined "disguising" behavior allowing for write (stdout FizzBuzz) instead of rigidly requiring e.g. string quoting write (stdout "FizzBuzz") whereas the lookup for command of the given name has the highest precedence and this lookup can be disabled with escaping - I didn't decide how to escape it but probably with ` or : or ; or \ as a prefix) and compiled to a binary.

  2. It is a "free form syntax" (see Arturo, Basil, Red, Rebol, Ren-C, Spry, ...) delimited by ASCII space character thus there are no rules regarding other white space characters, newlines, etc. and there are no true lexical separators/delimiters (though there is one lexical separator ;; as syntax sugar). So no ;, no Python-like indentation, nothing.

  3. Pipes between commands are implicit (producer filter consumer is a pipeline akin to producer | filter | consumer in POSIX shell). Except for the ( ... ) block which does not chain commands this way but rather plumbs the outputs from each listed command as input to the command right before the opening parenthesis. Note if the listed command is not a command, it'll get converted to an anonymous one based on the lowering ("disguising") rules - thus this language does not have any constants nor variables nor anything similar because everything will end up being a command, i.e. stream of samples).

  4. There are several types of blocks with different meanings:

    1. standalone asynchronous block { ... } where all "lose" ends of subgraphs/chains will be "merged" together asynchronously (i.e. an outgoing sample from a subgraph will not wait for a sample from another subgraph from the same block both are lexically situated in but rather immediately outputed as a new sample of the whole async block)
    2. standalone synchronous block < ... > where all "lose" ends of subgraphs/chains will be "merged" together synchronously (i.e. an outgoing sample from a subgraph will wait for a sample from all other subgraphs from the same block they are situated in and first then the synchronized "multisample" will be outputted as a new sample of the whole sync block)
    3. additional async/sync block ( ... ) as a suffix of any command
    4. standalone explicit pipe [ ... ] (note, this one might change or even get removed...) - useful for specifying properties of the pipe itself (e.g. to get an unordered pipe allowing the parallel scheduler runtime to decide to run many instances of the consumer of the pipe instead of just one instance in case the pipe has to stay ordered)
  5. Prefixes in command names designate special treatment:

    1. $ in $fizzbuzz means "also give label fizzbuzz to the following async/sync block" to allow its reuse by refering to it by label
    2. #{ ... } or #[ ... ] or #( ... ) or #< ... > (or escape character from (1) followed by any of these block delimiters) block as "anonymous block" (with recursion support) serving as multiline comment and thanks to $mylabel syntax referencable as a stream of one sample (thus allowing to evade escaping of complex content). Note this one might change or be removed in the future (if a better solution will be found). I'm considering also supporting multiple such "data blocks" for data - e.g. string - interpolation instead of baking this into strings (delimited by double and single quotes with the same meaning: "mystr" or 'mystr').
  6. @myid is an optional suffix of explicit pipes denoting the instance of the given pipe (note a pipe here is MPMC /multiple producer multiple consumer/ queue under the hood). If not used as suffix but standalone, it refers to the instance of the given pipe (useful for plumbing additional outputs to it or reading from it from other places - note this one might change or even get removed).

  7. @@ as standalone token is kind of a special case of (6) denoting an "anonymous instance of the pipe" serving as an anchor denoting a place of N-ary duplication of the sample (aka "stream splitting with copy semantics"). This pipe instance can be referenced by a single standalone @ character only from the same lexical scope (not from any higher nor deeper).

    E.g. producer @@ @ branch (x ;< 5) trash @ else trash (here ; is the escape character as discussed in (1) because I find it slightly less visually distracting than e.g. \).

    Note @@ can be left out as syntax sugar if there is no ambiguity: producer @ branch (x ;< 5) trash @ else trash.

    Muxing is thus achieved through the combination of @ and branch command (which has a mandatory counterpart command else to make each branching exhaustive).

  8. Instead of specifying or defining the structure of the samples (which is expected to be very hairy) everywhere, there will only be a concept of "refering to equality or similarity" - i.e. "my input is like output of anotherBlockLabel" instead of "my input has this extremely hairy long definition structure with all details like padding alignment etc". Refering to other commands like this shall support also composition and removal of arbitrary parts. This all will probably require having some dummy built-in commands offering inputs/outputs just for the sake of being able to refer to them, but I'm convinced it's worth it.

  9. Lexical scoping works from bottom to top and if not found from top to bottom. The lexically closest has precedence. Enforcing is possible with fully qualified path denoting all lexical scopes from top to bottom (the "path" delimiter could be / or : or such).

  10. Importing a library should be just another command working on aliases. Note, importing shall be well thought through because everything gets executed by default, so importing a "full library" would mean executing every single subgraph in the library which is presumably not what we want.

  11. There are many (10+ as of now) other unresolved and problematic topics & cases not covered in this crash course. I didn't want to pollute this discussion, so I left them out for now.

I know there are numerous inconsistencies and open questions in this syntax, so feel free to speak up (and ideally offer potential solutions :wink:). Key takeaways: everything is a stream (even if many have only one indefinitely repeating sample), everything is inherently parallel, there is a separate compile-time evaluation of subgraphs and separate run-time evaluation of subgraphs, cycles in subgraphs are allowed (necessary).

Let me know what you think, I'm all ears!

AyushmanTripathy commented 2 years ago

future goals for now then

  1. async code execution
  2. muti threading
  3. import / export support

did i get it right? sorry that you have to explain topics many times

dumblob commented 2 years ago

What I described is a standalone new language unrelated to your current pipe-script. I'm not familiar enough with your pipe-script to assess whether what I'm describing is applicable to your design (actually I'd even dare to guess it's not applicable :cry:).

So the question rather is whether you want to just take some of the ideas and implement them in pipe-script as a new feature. Or if you say "oh, my pipe-script is not well designed for these ideas, so I'll rather start from scratch and maybe even give a shot to the syntax you @dumblob proposed" :wink:.

Which of these 2 option would you choose?

AyushmanTripathy commented 2 years ago

yeah you are right pipe-script is not designed for this, lets start form scratch. 👍

dumblob commented 2 years ago

Ok, will you create a new repository so that I can subscribe and closely follow your development?

AyushmanTripathy commented 2 years ago

here you go

dumblob commented 2 years ago

I now wanted to showcase the discussion we had in the new repo to someone but found out it doesn't exist any more. Did you make the repo non-public or delete it completely? Do you have any archive of the discussion of ours?

AyushmanTripathy commented 2 years ago

sorry, but i have deleted the repo.

dumblob commented 2 years ago

That's a pity - some of your and mine work is gone. So let's take this as a life lesson - never remove data and instead only archive them or hide them.