enfiskutensykkel / ssd-gpu-dma

Build userspace NVMe drivers and storage applications with CUDA support
BSD 2-Clause "Simplified" License
342 stars 47 forks source link

Sperating SQ, CQ, and PRP List Memories #24

Closed ZaidQureshi closed 5 years ago

ZaidQureshi commented 5 years ago

So in ./benchmarks/cuda/queue.cu, I am trying to use separate allocations and DMA regions for the SQ, CQ, and PRP List. by doing something like the following:

__host__ DmaPtr prepareQueuePair(QueuePair& qp, const Controller& ctrl, const Settings& settings, uint16_t id)
{
    printf("Creating QP %u\n", (unsigned int) id);
    //size_t queueMemSize = 1024 * sizeof(nvm_cmd_t) + 1024 * sizeof(nvm_cpl_t);
    size_t sq_size = 1024 * sizeof(nvm_cmd_t);
    size_t cq_size = 1024 * sizeof(nvm_cpl_t);

    size_t prpListSize = ctrl.info.page_size * settings.numThreads * (settings.doubleBuffered + 1);

    auto sq_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(sq_size, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    auto cq_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(cq_size, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    auto prp_list_mem = createDma(ctrl.ctrl, NVM_PAGE_ALIGN(prpListSize, 1UL << 16), settings.cudaDevice, settings.adapter, settings.segmentId);
    // Set members
    qp.pageSize = ctrl.info.page_size;
    qp.blockSize = ctrl.ns.lba_data_size;
    qp.nvmNamespace = ctrl.ns.ns_id;
    qp.pagesPerChunk = settings.numPages;
    qp.doubleBuffered = settings.doubleBuffered;

    qp.prpList = NVM_DMA_OFFSET(prp_list_mem, 0);
    qp.prpListIoAddr = prp_list_mem->ioaddrs[0];

    // Create completion queue
    int status = nvm_admin_cq_create(ctrl.aq_ref, &qp.cq, id, cq_mem->vaddr, cq_mem->ioaddrs[0]);
    if (!nvm_ok(status))
    {
        throw error(string("Failed to create completion queue: ") + nvm_strerror(status));
    }
    printf("CQ MAX_ENTRIES: %u\n", (unsigned int) qp.cq.max_entries);
    // Get a valid device pointer for CQ doorbell
    void* devicePtr = nullptr;
    cudaError_t err = cudaHostGetDevicePointer(&devicePtr, (void*) qp.cq.db, 0);
    if (err != cudaSuccess)
    {
        throw error(string("Failed to get device pointer") + cudaGetErrorString(err));
    }
    qp.cq.db = (volatile uint32_t*) devicePtr;

    // Create submission queue
    status = nvm_admin_sq_create(ctrl.aq_ref, &qp.sq, &qp.cq, id, NVM_DMA_OFFSET(sq_mem, 0), sq_mem->ioaddrs[0]);
    if (!nvm_ok(status))
    {
        throw error(string("Failed to create submission queue: ") + nvm_strerror(status));
    }
    printf("SQ MAX_ENTRIES: %u\n", (unsigned int) qp.sq.max_entries);
    // Get a valid device pointer for SQ doorbell
    err = cudaHostGetDevicePointer(&devicePtr, (void*) qp.sq.db, 0);
    if (err != cudaSuccess)
    {
        throw error(string("Failed to get device pointer") + cudaGetErrorString(err));
    }
    qp.sq.db = (volatile uint32_t*) devicePtr;

    return NULL;
}

All of these allocations seem to be fine. However, when the GPU threads try to write to the Submission queue entry in prepareChunk with *cmd = local; I get threads accessing illegal memory addresses when they try to write the last 4 bytes of the 64 byte command entry. Am I doing something stupid? I have already tested 1024 entries in the command and completion queue using the original code so I know that part is fine. I just want to separate the memories for the 2 queues just so I avoid any errors.

enfiskutensykkel commented 5 years ago

How are you returning different memory allocations from from the function? It seems that sq_mem, cq_mem and prp_list_mem will be free'd (because going out of scope) and prepareQueuePair() returns NULL, which will leave the pointers set on qp stale.

I apologise, the code isn't very straight forward and it's an unholy mix of C and C++ way of doing things. IIRC, in the original code I make sure that the qmem that is returned from prepareQueuePair() doesn't go out of scope.

ZaidQureshi commented 5 years ago

I see, ok I can fix that. Thanks.