dgkf / R

An experimental reimagining of R
https://dgkf.github.io/R
GNU General Public License v3.0
135 stars 6 forks source link

Stricter recycling rules #98

Open sebffischer opened 6 months ago

sebffischer commented 6 months ago

Currently, the code below works:

> c(1, 2) * c(1, 2, 3)
[1] 1 4 3

I think this should be stricter and at least throw a warning (but I would probably even tend to an error) when the shorter length does not divide the longer length.

sebffischer commented 6 months ago

Maybe one wants to be even stricter, i.e. in a operation x + y, either length(x) == length(y) or length(x) == 1 | length(y) == 1. This would be in line with the broadcasting rules implemented in numpy or pytorch.

A strong argument for being stricter is that the recycling rules for vectors are then more consistent with the recycling rules for n-dimensional arrays. For example, it should also not be possible to multiply two matrices with shapes (2, 3) and (2, 4).

dgkf commented 6 months ago

Great catch! I remember thinking about this when implementing the current recycling... and promptly forgetting about it.

This style works great for objects that have a known length... which for now should account for all objects. But my goal is that objects eventually are calculated lazily, and possibly even from generators where the length is unknown.

All operators are currently set up with this laziness in mind. They are implemented on any representation that implements IntoIterator, with the goal that eventually a "vector" will be able to be represented as a generator.

That said, we can add something that is like... try_check_recycle_size_compatibility that checks for the size when both Iterator representations have a size hint, throwing an error if both sizes are known but incompatible, but otherwise silently continues with iterators of unknown size until both iterators are exhausted - which point we will know their size and can raise an appropriate error.

sebffischer commented 6 months ago

One thing that one also has to pay attention to is to make the lazy evaluation work correctly with error handling.

In the code below (assuming stricter recycling rules), the multiplication x * y should throw an error. However, because y is created from masking 1:3 it should have unknown size (unless we want to calculate the sum of the logical vector, which I guess we don't). This means, that when calling tryCatch(), the length of y is unknown and one cannot perform compliance checks with the recycling rules when constructing z. Only later, when mean() will be called on z, would we notice that something went wrong. However, the stop() call comes before that, so the error handling code would never be executed.

y <- (1:3)[c(TRUE, TRUE, TRUE)]
x <- 1:10
z <- tryCatch(x * y,  # this should throw
  error = function(e) {
     messagef("Catch all them errors")
     NA
  }
)

stop("wupsie")

mean(z)
sebffischer commented 3 months ago

I have started to think about this again and there is another relevant case that is also relevant to error-handling. This is mostly related to your suggestion from above:

That said, we can add something that is like... try_check_recycle_size_compatibility that checks for the size when both Iterator representations have a size hint, throwing an error if both sizes are known but incompatible, but otherwise silently continues with iterators of unknown size until both iterators are exhausted - which point we will know their size and can raise an appropriate error.

The one possible problem from above regarding the tryCatch() can in principle be solved for forcing the evaluation of the expression x * y, the same problem will also arise in other contexts I believe.

Let's say we do something like

print(1:3[ [true, true, true] ] + 1:2[ [true, true]])

With the recycling rules that I suggested, this should throw an error, as the lengths of the left and right subsets are 3 and 2. What I believe should happen here eventually is that

  1. 1:3 and 1:2 are represented using RepType::Iterator
  2. Subsetting the Iterator using the Mask, should adapt those iterators akin to a rust .filter(), creating two other RepType::Iterator representations.
  3. Adding the two filtered iterators produces the final iterator
  4. The print() call then iterates over the final iterator and starts printing [1] 1 4 but only then do we notice that an error should be thrown.

The whole problem becomes worse if we - instead of printing the final vector - subset the first element as shown below:

(1:3[ [true, true, true] ] + 1:2[ [true, true]])[1]

instead of printing the first two elements and then throwing an error, this operation will never notice that the addition of the two masked vectors is an invalid operation according to the stricter recycling rules, because it will only yield the first element of the final iterator.

So if we want both the stricter recycling rules and usage of RepType::Iterator, I think we always need to verify the validity of a vector representation (such as RepType::Iterator) when creating the representation, not later when yielding its elements.

sebffischer commented 2 months ago

@dgkf Do you agree that these examples seem to make it necessary to always have a known length? I would maybe try to work on that next