Xilinx / mlir-aie

An MLIR-based toolchain for AMD AI Engine-enabled devices.
Other
307 stars 90 forks source link

ISSUE: ERROR when using shared buffer and lock function #1743

Open ngdymx opened 2 months ago

ngdymx commented 2 months ago

Hi team,

I am working on using aie.use_lock to control the write and read sequence for sharing a buffer between two ComputeTiles. However, I’ve encountered an issue that I’m unsure how to resolve.

Could you please assist me with this? I would greatly appreciate your help.

I have configured the ComputeTiles (1, 4) and (0, 4) as follows: the Add one function is implemented on tile (1, 4), while the passthrough function is placed on tile (0, 4). The attached figure illustrates the dataflow diagram for this setup.

image

I mimic the aie.mlir under the folder /mlir-aie/mlir_tutorials/tutorial-3 to use the aie.use_lock.

%core_1_4 = aie.core(%tile_1_4) {      
      // while (1)                                                                                                                                      
      scf.for %arg0 = %c0 to %c9223372036854775807 step %c1  {         
      // for i in range(4):                                                                                                                                                                 
             scf.for %arg1 = %c0_0 to %c4 step %c1_1{                                                                                                                                 
                  %0 = aie.objectfifo.acquire @in(Consume, 1) : !aie.objectfifosubview<memref<128xi32>>                                                                                  
                  %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>     
                  //                                                             
                  aie.use_lock(%lock04, "Acquire", 0)                                                                                                                                                                                                                                                           
                  func.call @addOne(%1, %shareBuffer, %c128_i32) : (memref<128xi32>, memref<128xi32>, i32) -> ()                                                                         
                  aie.use_lock(%lock04, "Release", 1)     
                  //                                                                                                                               
                  aie.objectfifo.release @in(Consume, 1)                                                                                                                                 
           }                                                                                                                                                                         
    } 

%core_0_4 = aie.core(%tile_0_4) {   
     // while (1)                                                                                                                                     
     scf.for %arg0 = %c0 to %c9223372036854775807 step %c1 {    
     // for i in range(4):                                                                
           scf.for %arg1 = %c0_0 to %c4 step %c1_1{                                                                                                                                 
                 %0 = aie.objectfifo.acquire @out(Produce, 1) : !aie.objectfifosubview<memref<128xi32>>                                                                                 
                 %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>   
                 //
                 aie.use_lock(%lock04, "Acquire", 1)                                                                                                                                    
                 memref.copy %shareBuffer, %1 : memref<128xi32> to memref<128xi32>                                                                                                                  
                 aie.use_lock(%lock04, "Release", 0)  
                 //                                                                                                                                  
                 aie.objectfifo.release @out(Produce, 1)                                                                                                                                                                                                                                                                                                                
           }                                                                                                                                                                        
   }

The input data is:

image

The output I am expecting to see is as follows:

image

The output I got:

image

After obtaining this result, I modified thepassthrough function on tile (0, 4) to an add_i function, where i is the index parameter of the for loop, the range of i is 1 to 4 step 1. This change was made to test the running time of tile (0, 4).

image

The result I got:

image

I have confirmed that the running time for tile (0, 4) is 4, which matches our expectations.

Then, I modified theadd one function on tile (1, 4) to an add_i function to test the running time of tile (1, 4).

image

The result I got:

image

In this case, the running time for tile (1, 4) appears to occur only once. However, since tile (0, 4) operates successfully, the aie.use_lock acquire and release operations on tile (1, 4) should also function correctly. This suggests that the lock acquire/release on tile (1, 4) are actually ran for four times, but the function (kernel) is only being called once.

Would you be able to help me look into this matter? I would greatly appreciate it.

The following the main code:

Function code:

#define NOCPP

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <type_traits>

#include <aie_api/aie.hpp>

template <typename T, int N>
__attribute__((noinline)) void add_one_aie(T *restrict in, T *restrict out, const int32_t size) {
    event0();
  aie::vector<T, N> In;
  aie::vector<T, N> Out;
  aie::vector<T, N> ones = aie::broadcast<T, N>(1);

  for (int j = 0; j < size; j += N) 
    chess_prepare_for_pipelining chess_loop_range(6, ) { 
        In = aie::load_v<N>(in);
        in += N;
        Out = aie::add(In, ones);
        aie::store_v(out, Out);
        out += N;        
    }
  event1();
}

template <typename T, int N>
__attribute__((noinline)) void addi_aie(T *restrict in, T *restrict out, const int32_t size) {
    static int i = 1;
event0();
  aie::vector<T, N> In0;
  aie::vector<T, N> Out;
  aie::vector<T, N> iones = aie::broadcast<T, N>(i);
  aie::vector<T, N> ones = aie::broadcast<T, N>(1);
  aie::vector<T, N> zeros = aie::broadcast<T, N>(0);

  for (int j = 0; j < size; j += N) 
    chess_prepare_for_pipelining chess_loop_range(6, ) { 
        In0 = aie::load_v<N>(in);
        in += N;
        Out = aie::add(In0, iones);
        aie::store_v(out, Out);
        out += N;        
    }
    i = i + 1;
event1();
}

extern "C" {

void addi(int32_t *in, int32_t *out, int32_t tilesize) {
  addi_aie<int32_t, 16>(in, out, tilesize);
}

void addOne(int32_t *in, int32_t *out, int32_t tilesize) {
  add_one_aie<int32_t, 16>(in, out, tilesize);
}

} 

aie.mlir

module {
  aie.device(npu1_2col) {
    func.func private @addi(memref<128xi32>, memref<128xi32>, i32)
    func.func private @addOne(memref<128xi32>, memref<128xi32>, i32)
    %tile_0_0 = aie.tile(0, 0)
    %tile_1_0 = aie.tile(1, 0)
    %tile_0_4 = aie.tile(0, 4)
    %tile_1_4 = aie.tile(1, 4)
    aie.objectfifo @in(%tile_1_0, {%tile_1_4}, 2 : i32) : !aie.objectfifo<memref<128xi32>>
    aie.objectfifo @out(%tile_0_4, {%tile_0_0}, 2 : i32) : !aie.objectfifo<memref<128xi32>>
    %shareBuffer = aie.buffer(%tile_0_4) {sym_name = "shareBuffer"} : memref<128xi32> 
    %lock04 = aie.lock(%tile_0_4, 1) { sym_name = "lock_a04" }

    %core_1_4 = aie.core(%tile_1_4) {
      %c0 = arith.constant 0 : index
      %c9223372036854775807 = arith.constant 9223372036854775807 : index
      %c1 = arith.constant 1 : index
      scf.for %arg0 = %c0 to %c9223372036854775807 step %c1 {
        %c0_0 = arith.constant 0 : index
        %c4 = arith.constant 4 : index
        %c1_1 = arith.constant 1 : index
        scf.for %arg1 = %c0_0 to %c4 step %c1_1{
          %0 = aie.objectfifo.acquire @in(Consume, 1) : !aie.objectfifosubview<memref<128xi32>>
          %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>
          aie.use_lock(%lock04, "Acquire", 0)
          %c128_i32 = arith.constant 128 : i32
          func.call @addi(%1, %shareBuffer, %c128_i32) : (memref<128xi32>, memref<128xi32>, i32) -> ()
          aie.use_lock(%lock04, "Release", 1)
          aie.objectfifo.release @in(Consume, 1)
       }
      }
      aie.end
    } {link_with = "passThrough.o"}
    %core_0_4 = aie.core(%tile_0_4) {
      %c0 = arith.constant 0 : index
      %c9223372036854775807 = arith.constant 9223372036854775807 : index
      %c1 = arith.constant 1 : index
      scf.for %arg0 = %c0 to %c9223372036854775807 step %c1 {
        %c0_0 = arith.constant 0 : index
        %c4 = arith.constant 4 : index
        %c1_1 = arith.constant 1 : index
        scf.for %arg1 = %c0_0 to %c4 step %c1_1{
          %0 = aie.objectfifo.acquire @out(Produce, 1) : !aie.objectfifosubview<memref<128xi32>>
          %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>
          %c128_i32 = arith.constant 128 : i32
          aie.use_lock(%lock04, "Acquire", 1)
          memref.copy %shareBuffer, %1 : memref<128xi32> to memref<128xi32>
          //func.call @addi(%shareBuffer, %1, %c128_i32) : (memref<128xi32>, memref<128xi32>, i32) -> ()
          aie.use_lock(%lock04, "Release", 0)
          aie.objectfifo.release @out(Produce, 1)

        }
      }
      aie.end
    } {link_with = "passThrough.o"}
    aiex.runtime_sequence(%arg0: memref<512xi32>, %arg1: memref<512xi32>) {
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 1, 512][0, 0, 0, 1]) {id = 0 : i64, metadata = @out} : memref<512xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 1, 512][0, 0, 0, 1]) {id = 1 : i64, metadata = @in} : memref<512xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
    }
  }
}

Host:

#include <cstdint>
#include <fstream>
#include <iostream>
#include <sstream>

#include "test_utils.h"
#include "xrt/xrt_bo.h"

#ifndef DATATYPES_USING_DEFINED
#define DATATYPES_USING_DEFINED
// ------------------------------------------------------
// Configure this to match your buffer data type
// ------------------------------------------------------
// using DATATYPE = std::uint8_t;
// using DATATYPE = std::uint32_t;
using DATATYPE = std::int32_t;
#endif

namespace po = boost::program_options;

int main(int argc, const char *argv[]) {

  // Program arguments parsing
  po::options_description desc("Allowed options");
  po::variables_map vm;
  test_utils::add_default_options(desc);

  test_utils::parse_options(argc, argv, desc, vm);
  int verbosity = vm["verbosity"].as<int>();
  int trace_size = vm["trace_sz"].as<int>();

  constexpr bool VERIFY = true;
  constexpr int IN_VOLUME = 512;
  constexpr int OUT_VOLUME = IN_VOLUME;

  int IN_SIZE = IN_VOLUME * sizeof(DATATYPE);
  int OUT_SIZE = OUT_VOLUME * sizeof(DATATYPE);

  // Load instruction sequence
  std::vector<uint32_t> instr_v =
      test_utils::load_instr_sequence(vm["instr"].as<std::string>());

  if (verbosity >= 1)
    std::cout << "Sequence instr count: " << instr_v.size() << "\n";

  // Start the XRT context and load the kernel
  xrt::device device;
  xrt::kernel kernel;

  test_utils::init_xrt_load_kernel(device, kernel, verbosity,
                                   vm["xclbin"].as<std::string>(),
                                   vm["kernel"].as<std::string>());

  // set up the buffer objects
  auto bo_instr = xrt::bo(device, instr_v.size() * sizeof(int),
                          XCL_BO_FLAGS_CACHEABLE, kernel.group_id(1));
  auto bo_inA =
      xrt::bo(device, IN_SIZE, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
  auto bo_outC =
      xrt::bo(device, OUT_SIZE, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(4));

  if (verbosity >= 1)
    std::cout << "Writing data into buffer objects.\n";

  // Copy instruction stream to xrt buffer object
  void *bufInstr = bo_instr.map<void *>();
  memcpy(bufInstr, instr_v.data(), instr_v.size() * sizeof(int));

  // Initialize buffer bo_inA
  DATATYPE *bufInA = bo_inA.map<DATATYPE *>();
  for (int i = 0; i < IN_VOLUME; i++)
    bufInA[i] = i + 1;

// print input
  for (uint32_t i = 0; i < IN_VOLUME; i++) {
    int32_t test = bufInA[i];
    printf("%d ", test);
    if (i % 32 == 31){
        printf("\n");
    }

  }
  // Zero out buffer bo_outC
  DATATYPE *bufOut = bo_outC.map<DATATYPE *>();
  memset(bufOut, 0, OUT_SIZE);

  // sync host to device memories
  bo_instr.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_inA.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_outC.sync(XCL_BO_SYNC_BO_TO_DEVICE);

  float npu_time_total = 0;
  // Execute the kernel and wait to finish
  if (verbosity >= 1)
    std::cout << "Running Kernel.\n";
  auto start = std::chrono::high_resolution_clock::now();
  unsigned int opcode = 3;
  auto run =
      kernel(opcode, bo_instr, instr_v.size(), bo_inA, bo_outC);
  run.wait();
   auto stop = std::chrono::high_resolution_clock::now();

  // Sync device to host memories
  bo_outC.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
  float npu_time =
        std::chrono::duration_cast<std::chrono::microseconds>(stop - start)
                    .count();

  npu_time_total += npu_time;
  std::cout << std::endl
           << "Avg NPU time: " << npu_time_total << "us."
           << std::endl;

  // Compare out to golden
  int errors = 0;
  if (verbosity >= 1) {
    std::cout << "Verifying results ..." << std::endl;
  }
  for (uint32_t i = 0; i < IN_VOLUME; i++) {
    int32_t ref = bufInA[i] + 1;
    int32_t test = bufOut[i];
    if (test != ref) {
      if (verbosity >= 1)
        std::cout << "Error in output " << test << " != " << ref << std::endl;
      errors++;
    } else {
      if (verbosity >= 1)
        std::cout << "Correct output " << test << " == " << ref << std::endl;
    }
  }
//Print out
  for (uint32_t i = 0; i < IN_VOLUME; i++) {
    int32_t test = bufOut[i];
    printf("%d ", test);
    if (i % 32 == 31){
        printf("\n");
    }

  }

  // Print Pass/Fail result of our test
  if (!errors) {
    std::cout << std::endl << "PASS!" << std::endl << std::endl;
    return 0;
  } else {
    std::cout << std::endl
              << errors << " mismatches." << std::endl
              << std::endl;
    std::cout << std::endl << "fail." << std::endl << std::endl;
    return 1;
  }
}
#
# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# Copyright (C) 2024, Advanced Micro Devices, Inc.

SRCDIR := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

VITIS_ROOT ?= $(shell realpath $(dir $(shell which vitis))/../)
VITIS_AIETOOLS_DIR ?= ${VITIS_ROOT}/aietools
VITIS_AIE_INCLUDE_DIR ?= ${VITIS_ROOT}/aietools/data/versal_prod/lib
VITIS_AIE2_INCLUDE_DIR ?= ${VITIS_ROOT}/aietools/data/aie_ml/lib

CHESSCC1_FLAGS = -f -p me -P ${VITIS_AIE_INCLUDE_DIR} -I ${VITIS_AIETOOLS_DIR}/include
CHESSCC2_FLAGS = -f -p me -P ${VITIS_AIE2_INCLUDE_DIR} -I ${VITIS_AIETOOLS_DIR}/include -D__AIENGINE__=2 -D__AIEARCH__=20
CHESS_FLAGS = -P ${VITIS_AIE_INCLUDE_DIR}

CHESSCCWRAP1_FLAGS = aie -I ${VITIS_AIETOOLS_DIR}/include 
CHESSCCWRAP2_FLAGS = aie2 -I ${VITIS_AIETOOLS_DIR}/include 

trace_size = 8192

HOST_O_DIR := build/host
HOST_C_TARGET := host.exe

KERNEL_O_DIR := build/bitstream
KERNEL_SRCS := $(wildcard $(SRCDIR)/kernel/*.cc)
KERNEL_OBJS := $(patsubst $(SRCDIR)/kernel/%.cc, ${KERNEL_O_DIR}/%.o, $(KERNEL_SRCS))
KERNEL_HEADERS := $(wildcard $(SRCDIR)/kernel/*.h)

MLIR_O_DIR := kernel
MLIR_TARGET := ${MLIR_O_DIR}/aie.mlir

BITSTREAM_O_DIR := build/bitstream
XCLBIN_TARGET := ${BITSTREAM_O_DIR}/final.xclbin
INSTS_TARGET := ${BITSTREAM_O_DIR}/insts.txt

.PHONY: all kernel link bitstream host clean
all: ${XCLBIN_TARGET} ${INSTS_TARGET} ${HOST_C_TARGET}

clean:
    -@rm -rf build 
    -@rm -rf log

kernel: ${KERNEL_OBJS}

link: ${MLIR_TARGET}

bitstream: ${XCLBIN_TARGET}

host: ${HOST_C_TARGET}

# Build host
${HOST_C_TARGET}: ${SRCDIR}/host/host.cpp 
    rm -rf ${HOST_O_DIR}
    mkdir -p ${HOST_O_DIR}
    cd ${HOST_O_DIR} && cmake -E env CXXFLAGS="-std=c++23 -ggdb" cmake ../.. -D CMAKE_C_COMPILER=gcc-13 -D CMAKE_CXX_COMPILER=g++-13 -DTARGET_NAME=${HOST_C_TARGET} -Dsubdir=${subdir}
    cd ${HOST_O_DIR} && cmake --build . --config Release
    cp ${HOST_O_DIR}/${HOST_C_TARGET} ./

# Build kernels
${KERNEL_O_DIR}/%.o: ${SRCDIR}/kernel/%.cc ${KERNEL_HEADERS}
    mkdir -p ${@D}
    cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -DINT8_ACT -c $< -o ${@F}

# Build bitstream
${XCLBIN_TARGET}: ${MLIR_TARGET} 
    mkdir -p ${@D}
    cd ${BITSTREAM_O_DIR} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
        --aie-generate-npu --npu-insts-name=${INSTS_TARGET:${BITSTREAM_O_DIR}/%=%} $(<:${MLIR_O_DIR}/%=../../kernel/%) 
    #cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
    #           --aie-generate-npu --npu-insts-name=insts.txt $(<:%=../%)

.PHONY: run
run: ${HOST_C_TARGET} ${XCLBIN_TARGET} ${INSTS_TARGET} #sign
    export XRT_HACK_UNSECURE_LOADING_XCLBIN=1 && \
    ./$< -x ${XCLBIN_TARGET} -i ${INSTS_TARGET} -k MLIR_AIE

trace: ${HOST_C_TARGET} ${XCLBIN_TARGET} ${INSTS_TARGET} # sign
    export XRT_HACK_UNSECURE_LOADING_XCLBIN=1 && \
    ./$< -x ${XCLBIN_TARGET} -i ${INSTS_TARGET} -k MLIR_AIE -t ${trace_size}
    ./parse_trace.py --filename trace.txt --mlir ${MLIR_TARGET} --colshift 1 > trace_mm.json

run_py: ${XCLBIN_TARGET} ${INSTS_TARGET} ${SRCDIR}/host/test.py
    python3 ${SRCDIR}/host/test.py -x ${<} -i ${INSTS_TARGET} -k MLIR_AIE
AndraBisca commented 6 days ago

Hey @ngdymx, thank you for filing this issue! Your testing so far is great to help us isolate the problem.

I am suspicious that the lock pattern you tried only works for AIE1 architecture devices and not AIE2 architecture ones. Could you try instead of using a single lock for your shared buffer to use two locks: a producer lock which is initialized with 1, and a consumer lock initialized with 0. The use_lock calls will need to be updated as well.

Here is how that would look:

% lock_a04_prod = aie.lock(%tile_0_4, 0) {init = 1 : i32, sym_name = "lock_a04_prod"}
% lock_a04_cons = aie.lock(%tile_0_4, 1) {init = 0 : i32, sym_name = "lock_a04_cons"}

// on %tile_1_4:
aie.use_lock(%lock_a04_prod, AcquireGreaterEqual, 1)
// function call
aie.use_lock(%lock_a04_cons, Release, 1)

// on %tile_1_4:
aie.use_lock(%lock_a04_cons, AcquireGreaterEqual, 1)
// function call
aie.use_lock(%lock_a04_prod, Release, 1)
AndraBisca commented 6 days ago

After going over the mlir-tutorials, I realize that only AIE1 locks are documented. I will take an action item to update them.

In the meantime, you can double check your lock logic by studying the output of the objectfifo lowering pass, which you can call with: aie-opt --aie-objectFifo-stateful-transform <name of your mlir file> You can also comb through some of the already existing mlir examples in test/objectFifo-stateful-transform/

It can be a little bit tedious but on small examples it can be insightful.