Does not work with structured arrays

jeffpeck10x commented 8 months ago

.npy files can contain structured arrays, as described here, fail to open with error:

Uncaught SyntaxError: Unexpected token ( in JSON at position 28

Example, in python create the structured array:

np.save('test/out.npy', np.array([('Rex', 9, 81.0), ('Fido', 3, 27.0)],
             dtype=[('name', 'U10'), ('age', 'i4'), ('weight', 'f4')]))

Serve the out.npy file:

cd test
npx serve

Try to load that out.npy file:

> np.load("http://localhost:3000/out.npy")
Promise {
  <pending>,
  [Symbol(async_id_symbol)]: 4454,
  [Symbol(trigger_async_id_symbol)]: 5
}
> Uncaught SyntaxError: Unexpected token ( in JSON at position 28

jeffpeck10x commented 8 months ago

Here is a rough idea of how to parse a structured array. It changes things a bit with the library, but this stand-alone example worked with a structured array that I was working with. Maybe this will be helpful for others.


const dtypes = {
  u1: {
    bytesPerElement: Uint8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint8'
  },
  u2: {
    bytesPerElement: Uint16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getUint16'
  },
  i1: {
    bytesPerElement: Int8Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt8'
  },
  i2: {
    bytesPerElement: Int16Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt16'
  },
  u4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  i4: {
    bytesPerElement: Int32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getInt32'
  },
  u8: {
    bytesPerElement: BigUint64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigUint64'
  },
  i8: {
    bytesPerElement: BigInt64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getBigInt64'
  },
  f4: {
    bytesPerElement: Float32Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat32'
  },
  f8: {
    bytesPerElement: Float64Array.BYTES_PER_ELEMENT,
    dvFnName: 'getFloat64'
  },
};

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => ({
      columnName,
      littleEndian: endianess === "<",
      bytesPerElement: dtypes[dtype].bytesPerElement,
      dvFn: (...args) => dv[dtypes[dtype].dvFnName](...args),
    })
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  const stride = columns
    .map((c) => c.bytesPerElement)
    .reduce((sum, numBytes) => sum + numBytes, 0);

  const data = [];
  let i, c, offset, dataIdx, row;
  const numColumns = columns.length;
  for (i = offsetBytes; i < numRows; i++) {
    offset = 0;
    row = {};
    dataIdx = i * stride;
    for (c = 0; c < numColumns; c++) {
      row[columns[c].columnName] = columns[c].dvFn(dataIdx + offset, columns[c].littleEndian);
      offset += columns[c].bytesPerElement;
    }
    data.push(row);
  }

  return data;
}

j6k4m8 commented 7 months ago

Wow thank you for the thoughtful comments @jeffpeck10x!! Would you be interested in turning this into a PR? Otherwise I'm happy to incorporate these suggestions in my next dev push :)

jeffpeck10x commented 7 months ago

Thanks @j6k4m8 . I don't think I have the time to generalize this into a more complete approach. I did end up taking this a little farther, although ended up ultimately using Apache Arrow for my use-case. That said, here is some helpful code to get this started.

The following function should reliably fetch columns and numRow from the header:

function parseHeaderContents(hcontents) {
  const [, descr, fortranOrder, shape] = hcontents.match(
    /{'descr': (.*), 'fortran_order': (.*), 'shape': (.*), }/
  );

  let offset = 0;
  const columns = [...descr.matchAll(/\('([^']+)', '([\|<>])([^']+)'\)/g)].map(
    ([, columnName, endianess, dtype]) => {
      const littleEndian = endianess === "<";
      const ret = {
        columnName,
        dvFnName: dtypes[dtype].dvFnName,
        TypedArray: dtypes[dtype].TypedArray,
      };
      offset += dtypes[dtype].bytesPerElement;
      return ret;
    }
  );

  const [, numRows] = shape.match(/\((\d+),\)/);

  return { columns, numRows };
}

I ended up writing the following parse function to get at the data when I was experimenting:

function parse(arrayBufferContents) {
  const dv = new DataView(arrayBufferContents);

  const headerLength = dv.getUint16(8, true);
  const offsetBytes = 10 + headerLength;

  const hcontents = new TextDecoder("utf-8").decode(
    new Uint8Array(arrayBufferContents.slice(10, 10 + headerLength))
  );

  const { columns, numRows } = parseHeaderContents(hcontents);
  const columnNames = columns.map((c) => c.columnName);
  const offsets = columns
    .map((c) => c.TypedArray.BYTES_PER_ELEMENT)
    .reduce((arr, v, i) => [...arr, v + arr[i]], [0]);
  const stride = offsets.pop();
  const dvFnNames = columns.map((c) => c.dvFnName);
  const dataViews = offsets.map(
    (offset) => new DataView(arrayBufferContents, offsetBytes + offset)
  );
  const dataViewGetters = dataViews.map((dv, i) => dv[dvFnNames[i]].bind(dv));
  const data = columnNames.reduce(
    (obj, columnName, i) => ({
      ...obj,
      [columnName]: new columns[i].TypedArray(numRows),
    }),
    {}
  );

  for (let j = 0; j < columnNames.length; j++) {
    const columnName = columnNames[j];
    const getter = dataViewGetters[j];
    const column = data[columnName];

    for (let i = 0; i < numRows; i++) {
      column[i] = getter(i * stride, true);
    }
  }

  return data;
}

I think there are ideas from that which can be used to generalize the approach.

jeffpeck10x commented 7 months ago

I know this all goes in a slightly different direction than your library, although using your library as the base gave me a really good headstart, just to even realize that I can use DataViews and offsets, and where to find the metadata in the headers. So, if any of this helps to improve the versatility of your library, great! And full circle. Don't feel like any pressure to do this though. In my particular use-case, as mentioned, I am all set with Apache Arrow IPC (ultimately needed to send this data between two processes, so more suited anyway).

aplbrain / npyjs

Does not work with structured arrays #42