eduardoleao052 / js-pytorch

A JavaScript library like PyTorch, built from scratch.
MIT License
907 stars 33 forks source link

Define Multidimensional Array Type #18

Open pparke opened 3 months ago

pparke commented 3 months ago

It seems that most of the anys are due to passing around arrays of various shapes. I think we can narrow down the type to either 1 or 2 dimensional arrays of numbers. Something like MNArray = number[] | number[][] might do. Are there any situations where arrays would be more deeply nested?

I'm curious if the arrays also need to hold class instances, there are several places where there is a check like typeof a === "object" but it mostly looks like this is being used to check if a is an array? If it is a check for an array it would be better to use Array.isArray(a).

I could make a PR for those changes if my assumptions turn out to be correct. Just let me know.

eduardoleao052 commented 3 months ago

Good points. So, in order:

Thanks for pointing it out!

medic-code commented 3 months ago
  1. Yep - sounds like we should have this as a type not just used in utils (maybe worth redefining the file name), I had a sense this type may be more broadly used.
  2. Agree on `Array.isArray(a), the narrower type guard seems sensible.
mgcrea commented 3 months ago

Would be best to support an arbitrary number of dimensions using generics, a bit like what @tensorflow/tfjs-core does with a generic tensor + aliases.

/** @doclink Tensor */
export type Scalar = Tensor<Rank.R0>;
/** @doclink Tensor */
export type Tensor1D = Tensor<Rank.R1>;
/** @doclink Tensor */
export type Tensor2D = Tensor<Rank.R2>;

Another useful Typescript helper type I use often when doing ml in js is the Tuple type:

That lets you properly cast tuples with an arbitrary dimension, eg. Tuple<number, 4> => [number, number, number, number] vs. a less useful number[].

export type Tuple<T, N extends number> = N extends N
  ? number extends N
    ? T[]
    : _TupleOf<T, N, []>
  : never;

type _TupleOf<T, N extends number, R extends unknown[]> = R["length"] extends N
  ? R
  : _TupleOf<T, N, [T, ...R]>;

Hope it's not off-topic! Pretty excited to see where this goes!

Will be happy to test/bench this against @u4/opencv4nodejs or @tensorflow/tfjs-node that I both currently use on a project.

eduardoleao052 commented 3 months ago

@mgcrea That's true, Tuples are a really useful data type in ML. Thanks for pointing it out! About the arbitrary dimension number, it could also work, but for normal ML and NLP applications, I believe we would need to go to at least Rank 4.

pparke commented 3 months ago

@mgcrea @eduardoleao052 I made a first attempt at adding those types in this PR https://github.com/eduardoleao052/js-torch/pull/21 there are still a few type errors left that require some careful thought. The added if statements could be replaced with assureArray if it is possible to update that so it doesn't return null but I don't know if some of the library logic is dependant on it returning null when null is passed in.

eduardoleao052 commented 3 months ago

I'll take a look and let you know, but pretty sure it's not necessary. Thanks again!

kgryte commented 3 months ago

Rather than defining your own multi-dimensional array type, one option you could consider is directly using, and building on, existing primitives already available in other libraries. E.g., @stdlib/ndarray. In principle, you could subclass and append various operators/methods as you do now, but you'd also allow for interoperation with other libraries in the ecosystem (e.g., all of @stdlib/stdlib). At least for CPU-based computation, that would allow you to leverage (when evaluated in eager mode) stdlib's ndarray kernels both in JavaScript and, in theory, C, through stdlib's native bindings.

eduardoleao052 commented 3 months ago

@kgryte Thanks for the tip! Really appreciate it. I'm just initially against adding bindings and necessary dependencies, especially as the library is currently completely independent and "from scratch". However, it is still possible to draw inspiration from the implementations on ndarray. Even though the applications are different and a lot of changes must be made, There is still much value in learning from their implementation. I'll surely take a look. Thanks again for the tip!

kgryte commented 3 months ago

@eduardoleao052 Fair enough. If there isn't the goal of ecosystem interoperation, then creating a custom type for use in a walled garden is certainly sensible. This stated, if interoperation is a goal, then conforming to existing data structure interfaces is easier to do at the beginning, rather than much farther down the line when significant refactorings would then be required. I should note that standardization and interoperation is the principle goal of the Consortium for Python Data API Standards, of which PyTorch is part. And there, we've intentionally standardized a minimal array type, similar to what is present in @stdlib/stdlib, in order to reduce ecosystem fragmentation.

eduardoleao052 commented 3 months ago

@kgryte that makes sense. I think interoperation could bring immense benefits to the library and its usability. I'll study in the next few days how I could do that in a way that works with the other libraries in the ecosystem! Thanks for the tip!