NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.35k stars 13.59k forks source link

[enhancement] New category for sourceProvenance: machineGenerated #335487

Open AndersonTorres opened 3 weeks ago

AndersonTorres commented 3 weeks ago

While building Dorion from sources #265771, a minified JS code is downloaded as an input source.

It is not a code written by a human being. It is not meant to be readable by human beings (I had headaches trying to do it, believe me).

However it is not binary machine or bytecode either.

I suggest this new source type: machineGenerated.

Well, the name can be a bit misleading, given that bytecodes are machine-generated too. I am open to suggestions!

pbsds commented 3 weeks ago

To introduce such a source type we need to clearly specify when it applies and when it does not and binaryBytecode is rather to be used.

Things i believe it should apply to:

Things I believe it should not apply to:

AndersonTorres commented 3 weeks ago

Well, I believe binaryBytecode is precisely applied for cases like JVM, in which the code is a soup of bytes to be read by a virtual machine not corresponding to a real world computer. In this sense, strangely, binary files to be executed by MMIX from Knuth are binaryBytecode (since no one is crazy enough to implement it).

On the other hand, files from IOCCC are fromSource regardless their (lack of) readability.

Splitting this in two categories, {readable,unreadable}MachineGenerated is a good idea?

Further, there is at least one good reason to have the machineGenerated class: the Bootstrappable Project does not like machine-generated code like Haskell-to-C.

pbsds commented 3 weeks ago

I don't think readable/unreadable is a good distinction. What i hoped to illustrate is that what we consider bytecode and not is kinda arbitrary.

consider cpython bytecode:

>>> def hello(a, b): return a + b
>>> hello.__code__.co_code
b'\x97\x00|\x00|\x01z\x00\x00\x00S\x00'
>>> import dis
>>> dis.dis(hello)
  1           0 RESUME                   0
              2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP                0 (+)
             10 RETURN_VALUE

In its binary form we consider it bytecode, but in its disassembled form one might very well consider it machine generated assembly. This bytecode is designed to run in the python runtime. Minified js, or maybe even jsfuck, is designed to run in the javascript runtime. Whether the parser is recursive or just a simple switch-case lookup is really just an implementation detail of the runtime. binaryBytecode is machine generated code that requires a runtime or vm to run, as opposed to binaryNativeCode which is inherently platform dependent.

emilazy commented 2 weeks ago

I don’t think there’s any real difference between binary native code and a machine-generated pile of C except that you can open one of them in a text editor. The difference with binary byte code is, I guess, that it is expected to run on a VM that may or may not have some kind of sandboxing? But I don’t really know why sourceProvenance is so elaborate, or what use the distinction would be to people; to me it’s just from source, or not from source.

(But I agree that we need some way to represent this.)