aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
449 stars 151 forks source link

[HuggingFace] `torch_neuronx.trace` exports `torch.int64` tensors as `NRT_DTYPE_INT32` with `dim * 2` #785

Open OlivierDehaene opened 11 months ago

OlivierDehaene commented 11 months ago

Hello

There is a bug in torch_neuronx.trace: torch.int64 tensors are exported as NRT_DTYPE_INT32 with dim * 2 in the NEFF file (for example, a [2x512] torch.int64 tensor would be exported as a [2x1024] NRT_DTYPE_INT32 tensor).

Here is an example:

test.py :

import torch
import torch_neuronx

def fn(a):
    return a * 2

examples_inputs = torch.randint(0, 32, (1, 32), dtype=torch.int64)

trace = torch_neuronx.trace(fn, examples_inputs, compiler_workdir="./")

test.c

#include <stdbool.h>
#include <nrt/nrt.h>
#include <nrt/nrt_experimental.h>

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

#define P_ERR(...) fprintf(stderr, __VA_ARGS__)

#define CHECK_RESULT(res, expected, ...)    \
    if (res != expected) {                  \
        fprintf(stderr, __VA_ARGS__);       \
        exit(-1);                           \
    }

// Function to mmap a file in the application's memory space,
// it will return a pointer to the mmapped memory and the size
// of the mmapped data will be written to *size
void *mmap_file(const char *filepath, size_t *size) {
    struct stat sb;
    int fd = open(filepath, O_RDONLY);
    if (fd < 0 || fstat(fd, &sb) != 0) {
        fprintf(stderr, "Unable to open %s: %s\n", filepath, strerror(errno));
        return MAP_FAILED;
    }
    *size = sb.st_size;
    return mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
}

void iterate_tensors(nrt_tensor_info_array_t *info_array) {
    int tensor_idx;
    nrt_tensor_info_t *tensor_info = NULL;

    for (tensor_idx = 0; tensor_idx < info_array->tensor_count; tensor_idx++) {
        tensor_info = &info_array->tensor_array[tensor_idx];
        printf("NRT DType %u\n", tensor_info->dtype);

        int shape_idx;
        uint32_t s;
        printf("Shape: ");
        for (shape_idx = 0; shape_idx < tensor_info->ndim; shape_idx++) {
            s = tensor_info->shape[shape_idx];
            printf("%u\t", s);
        }
        printf("\n");
    }
}

int main(int argc, char *argv[]) {
    NRT_STATUS result;
    void *neff_data = NULL;
    size_t neff_size = 0;

    nrt_model_t *model = NULL;
    nrt_tensor_info_array_t *tensor_info_array = NULL;

    // Try mmapping the NEFF file first, so we can fail fast if not found or
    // mmap fails
    neff_data = mmap_file(argv[1], &neff_size);
    if (neff_data == MAP_FAILED) {
        fprintf(stderr, "Unable to map file %s\n", argv[1]);
        exit(-1);
    }

    // Before calling any nrt API, nrt_init must be called
    // Since this is not running as part of a framework, the correct parameter for 'framework' is
    // NRT_FRAMEWORK_TYPE_NO_FW and the others can be empty strings
    result = nrt_init(NRT_FRAMEWORK_TYPE_NO_FW, "", "");
    CHECK_RESULT(result, NRT_SUCCESS, "NRTLIB could not be initialized, error: %d\n", (int)result);

    // Loading the NEFF
    printf("Loading NEFF\n");
    result = nrt_load(neff_data, neff_size, -1, -1, &model);
    CHECK_RESULT(result, NRT_SUCCESS, "Unable to load NEFF\n");

    // In order to allocate tensors, first we need to call nrt_get_model_tensor_info which
    // will give us the model tensors' names and sizes in tensor_info_array
    printf("Getting IO tensor information\n");
    result = nrt_get_model_tensor_info(model, &tensor_info_array);
    CHECK_RESULT(result, NRT_SUCCESS, "Unable to get model tensor information\n");

    iterate_tensors(tensor_info_array);

    // Unloading the model
    result = nrt_unload(model);
    if (result != NRT_SUCCESS) {
        P_ERR("Unable to unload NEFF\n");
    }

    printf("Deallocating model tensor info\n");
    // We are done with the tensor_info_array, we can dispose of it
    nrt_free_model_tensor_info(tensor_info_array);

    // Clean-up the runtime
    printf("Cleaning up the runtime\n");
    nrt_close();

    printf("DONE\n");
}
python test.py

gcc test.c -o test_neff -lnrt  -I/opt/aws/neuron/include -L/opt/aws/neuron/lib
./test_neff graph.neff

# Loading NEFF
# Getting IO tensor information
# NRT DType 8
# Shape: 1  64
# NRT DType 8
# Shape: 1  64
# Deallocating model tensor info
# Cleaning up the runtime
# DONE

DType should be 10 and Shape should be 1x32.

aws-rhsoln commented 11 months ago

We do typecast int64 tensors to int32. This happens in the compiler where it treats int64 datatypes as int32 by truncating the high-order bits (https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/compiler/neuronx-cc/index.html?highlight=int64#known-issues). This is because the hardware doesn't support int64 data-types natively.

OlivierDehaene commented 11 months ago

I see. Maybe a warning during compilation could be added?

This still feels like a bug as a user of the SDK.

Having dtype int32 + [1x64] is error prone and does not provide the info back to the neff user than the tensor is not a "true" int32 but a int64 camouflaged as int32.

The compiler saying that it casted to int32 while not really doing so (because the size is the same) is very weird.