CrayLabs / SmartSim

SmartSim Infrastructure Library.
BSD 2-Clause "Simplified" License
227 stars 36 forks source link

'SmartRedis::RuntimeException' error encountered due to memory layout being contiguous #421

Closed syedalihasany closed 9 months ago

syedalihasany commented 10 months ago

Description

I am running a C++ program that sends input tensors of size 1000 by 6 to a pytorch model using smartsim and retrieves output tensors of size 1000 by 1. I initialize the smartsim/smartredis orchestrator using the python interpreter and then execute the C++ binary file and I get the following error message:

terminate called after throwing an instance of 'SmartRedis::RuntimeException'
  what():  The destination memory space dimension vector should only be of size one if the memory layout is contiguous.
Aborted (core dumped)

How to reproduce

My C++ code snippet initializing the client (and putting tensors) is as follows and I think the issue could be resolved by changing SRMemLayoutContiguous parameter in the client.put_tensor and client.unpack_tensor methods. I just don't know how like should I exclude this parameter entirely or change it to something else?:

// Initialize a SmartRedis client
    bool cluster_mode = false; // Set to false if not using a clustered database
    SmartRedis::Client client(cluster_mode, __FILE__);
    std::cout<<"set client"<<std::endl;

    // Use the client to set a model in the database from a file
    std::string model_key = "ali_model";
    std::string model_file = "../ali_model_scripted.pt";
    std::cout<<"USING CPU"<<std::endl;
    client.set_model_from_file(model_key, model_file, "TORCH", "CPU",1000); // the last parameter is the batch size should we pass this as 100k
    std::cout<<"set model"<<std::endl;

    // Declare keys that we will use in forthcoming client commands
    std::string in_key = "input_key";
    std::string out_key = "output_key";

    // Put the tensor into the database that was loaded from file
    client.put_tensor(in_key, input_tensor.data(), dims, SRTensorTypeFloat, SRMemLayoutContiguous);

    // running the model 
    client.run_model(model_key, {in_key}, {out_key});

    // assigning the dimensions of the output tensor to the output_dims vector
    std::vector<size_t> output_dims = {1000, 1};

    std::vector<float> result(1000, 0);
    client.unpack_tensor(out_key, result.data(), output_dims,SRTensorTypeFloat, SRMemLayoutContiguous);

Expected behavior

The code should execute without any errors.

System

billschereriii commented 10 months ago

Hi, thanks for reaching out to us.

I think you need to use the nested layout, not the contiguous layout, for this data. You can update your code as follows:

 // Put the tensor into the database that was loaded from file
    client.put_tensor(in_key, input_tensor.data(), dims, SRTensorTypeFloat, SRMemLayoutNested);

    // running the model 
    client.run_model(model_key, {in_key}, {out_key});

    // assigning the dimensions of the output tensor to the output_dims vector
    std::vector<size_t> output_dims = {1000, 1};

    std::vector<float> result(1000, 0);
    client.unpack_tensor(out_key, result.data(), output_dims,SRTensorTypeFloat, SRMemLayoutNested);

As an aside, if you can tweak your model to output a single-dimensional tensor with dimensions {6000} -- rather than a two-dimensional tensor with dimensions {6000, 1} -- you can use contiguous with the unpack call and it will be a bit more efficient.

Please let us know if this works out for you! -- Bill

syedalihasany commented 10 months ago

Hi Bill,

I tried this but now I have run into another issue my Orchestrator fails to start

I am using the following python commands to start the Orchestrator:

import smartsim
import smartredis
from smartredis import Client
from smartsim import Experiment

REDIS_PORT=6379             
exp = Experiment("moving_tensors", launcher="local")                
db = exp.create_database(db_nodes=1,port=REDIS_PORT,interface="lo")
exp.generate(db)                                
exp.start(db)

I get the following error message:

22:38:58 lipc02 SmartSim[276840] INFO Working in previously created experiment
>>> exp.start(db)
22:39:16 lipc02 SmartSim[276840] ERROR Orchestrator failed during startup See /home/bohan/blackscholes/blackscholes/using_smart_reddis/moving_tensors/database for details
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bohan/.local/lib/python3.8/site-packages/smartsim/experiment.py", line 192, in start
    self._control.start(
  File "/home/bohan/.local/lib/python3.8/site-packages/smartsim/_core/control/controller.py", line 90, in start
    self._launch(manifest)
  File "/home/bohan/.local/lib/python3.8/site-packages/smartsim/_core/control/controller.py", line 303, in _launch
    self._launch_orchestrator(orchestrator)
  File "/home/bohan/.local/lib/python3.8/site-packages/smartsim/_core/control/controller.py", line 362, in _launch_orchestrator
    self._orchestrator_launch_wait(orchestrator)
  File "/home/bohan/.local/lib/python3.8/site-packages/smartsim/_core/control/controller.py", line 555, in _orchestrator_launch_wait
    raise SmartSimError(msg)
smartsim.error.errors.SmartSimError: Orchestrator failed during startup See /home/bohan/blackscholes/blackscholes/using_smart_reddis/moving_tensors/database for details

when I run the CPP code (which will move input tensors to the pytorch model and get the output tensors) I get the following error message: Segmentation fault (core dumped)

What I am doing is that I start the Orchestrator from python to handle the in-memory database then I execute a CPP code which puts input tensors in memory and runs a jit traced torch model to get the output tensors which I write to a file. Am I doing something wrong? The CPP code for that is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <fstream>
#include <iostream>
#include <vector>
#include <iomanip>
#include <chrono>
#include <sstream>
#include <stdlib.h>
// including the redis client header
#include "client.h"

int main(int argc, char* argv[]) {
    std::cout<<"start"<<std::endl;
    // Set environment variables
    setenv("SR_LOG_FILE", "smartredis.log", 1);
    setenv("SR_LOG_LEVEL", "INFO", 1);
    setenv("SSDB", "localhost:6379", 1); // Adjust the address and port as needed

    // Initialize a vector that will hold the input tensor
    size_t n_rows = 1000;
    size_t n_cols = 6;
    size_t n_values = n_rows * n_cols;
    std::vector<float> input_tensor(n_values, 0);
    std::vector<size_t> dims = {1000, 6};

    // Read values from the tab separated input feature file
    std::string input_file = "../input_features.txt";
    std::ifstream file(input_file);
    std::cout<<"after inputs"<<std::endl;
    if (!file.is_open()) {
        std::cerr << "Error opening file: " << input_file << std::endl;
        return 1;
    }

    for (size_t row = 0; row < n_rows; row++) {
        for (size_t col = 0; col < n_cols; col++) {
            float value;
            if (col < n_cols - 1) {
                file >> value;
                file.ignore(1); // Skip the tab character
            } else {
                file >> value;
            }
            input_tensor[row * n_cols + col] = value; // makes the 100k by 6 into a linear tensor
        }
    }

    file.close();

    // Initialize a SmartRedis client
    bool cluster_mode = false; // Set to false if not using a clustered database
    SmartRedis::Client client(cluster_mode, __FILE__);
    std::cout<<"set client"<<std::endl;

    // Use the client to set a model in the database from a file
    std::string model_key = "ali_model";
    std::string model_file = "../ali_model_scripted.pt";
    std::cout<<"USING CPU"<<std::endl;
    client.set_model_from_file(model_key, model_file, "TORCH", "CPU",1000); // the last parameter is the batch size should we pass this as 100k
    std::cout<<"set model"<<std::endl;

    // Declare keys that we will use in forthcoming client commands
    std::string in_key = "input_key";
    std::string out_key = "output_key";

    // Put the tensor into the database that was loaded from file
    client.put_tensor(in_key, input_tensor.data(), dims, SRTensorTypeFloat, SRMemLayoutNested);

    // running the model 
    client.run_model(model_key, {in_key}, {out_key});

    // assigning the dimensions of the output tensor to the output_dims vector
    std::vector<size_t> output_dims = {1000, 1};

    std::vector<float> result(1000, 0);
    client.unpack_tensor(out_key, result.data(), output_dims,SRTensorTypeFloat, SRMemLayoutNested);

    // Create an output file stream
    std::ofstream outputFile("./ali_model_results_using_Cpp_and_Redis.txt");

    if (outputFile.is_open()) {
        for (size_t i = 0; i < result.size(); i++) {
            outputFile << result[i] << std::endl;
    }
        outputFile.close();
    } else {
        std::cerr << "Error: Unable to open the output file." << std::endl;
    }

    return 0;
}
billschereriii commented 10 months ago

Hi, with respect to the Orchestrator failed during startup message, this is likely because you have an existing Orchestrator (Redis database) that is already running and using the port you've requested. Please make sure you have an experiment.stop() line in your python script to make sure to shut down the existing database. You can also manually kill it via the unix kill command if you can find the PIDs for it (grep for "redis") or by issuing the following command from the SmartRedis root (log into the node with the Redis database in it):

$ third-party/redis/src/redis-cli -p 6379 shutdown

As for the segfault, first off, I steered you wrong when I said to mark the call to put_tensor() as a nested layout. Now that I see how you've set up the memory in a single array, contiguous is the way you need to go. In a nested layout, you would have an array of pointers to arrays containnig rows of data, The client, when it attempted to dereference those pointers, found that they weren't really pointers and that's what led to the segmentation fault.

Rather than trying to use the C++ Vector STL class for your data, you might be better off using a plain multi-dimensional array. You can see a good example of how to initialize one in the SmartRedis tests: Please refer to tests/cpp/client_test_put_get_2D.cpp for the code. If you do switch to a multi-dimensional array of this form, you will need to mark your memory layout as nested both for the put_tensor() and unpack_tensor() calls.

Please let me know how it goes! -- Bill

billschereriii commented 9 months ago

Hi, I wanted to follow up to see if you are up and running now?

billschereriii commented 9 months ago

Since we haven't heard back from you, I'm going to assume that you are up and running and that all is going well now. If this isn't the case or if you have further difficulties, please don't hesitate to reach out to us again!