SciSharp / TensorFlow.NET

.NET Standard bindings for Google's TensorFlow for developing, training and deploying Machine Learning models in C# and F#.
https://scisharp.github.io/tensorflow-net-docs
Apache License 2.0
3.23k stars 518 forks source link

why .net version is slower than python version ? #493

Open Mghobadid opened 4 years ago

Mghobadid commented 4 years ago

hi , guys . i have ssd-light model pb file . its trained wit python.

in python(Anaconda) & cpu version of tensorflow , the cpu always under 50 % but in TenserFlow.net cpu ,it rise up to 80 % .

where is problem?

Oceania2018 commented 4 years ago

Can you provide the example? It should not slow than python version.

Mghobadid commented 4 years ago

python version: https://paste.ubuntu.com/p/GrnTgF4T3N/ C# version : https://paste.ubuntu.com/p/68q3vbPGbz/

tcwicks commented 4 years ago

I also noticed this. However the slowdown is not coming from Tensorflow.net itself. The slowdown is coming from NumSharp as well as data preparation costs. If data prep and numsharp operations are done in series with training on a single thread then you are not going to be utilizing the GPU at 100% of its capability.

Also the dot net implementation of numsharp for obvious reasons cannot do some of the mind_boggle.not_type_safe.cast_pig_into_bird tricks that Python allows after which debugging any non trivial python code becomes brain_fried.developer.kill_me_now

If you add a System.Diagnostics.Stopwatch around the call to Sess.run var results = sess.run(outTensorArr, new FeedItem(image_tensor, image_np));

And another one around : image_np = image_np.reshape(1, frame.shape[0], frame.shape[1], 3);

You will see a significant time is spent on Data preparation. Almost equal to the time spent on the actual tensorflow call. What I see is that for example: To fully feed (100% Cuda utilization) a Variational Encoder running on a GTX 1080 TI the overhead of Numsharp slice / resize and shuffle operations with a batch size of 64 and data size of 12288 floats (64 pixel X 64 Pixel RGB image) easily consumes 100% (all 8 cores) of an I7 4790K overclocked to 4.4GHz

However the power of having all this Tensorflow in Dot Net lies in utilizing the full power of a programming language like C#. Example Muti threading anything non trivial in Python is just a nightmare and I have previously burned months trying to get it to work decently.

Try move the data preparation into a separate thread which queues the prepared data into a thread safe Queue. Then in your training thread just dequeue the prepared data and feed that into tensorflow:

This is what I use for a threadsafe Queue where you can have one thread(s) writing to it and a separate thread(s) reading from it and feeding tensorflow Note: I've used ReaderWriterLockSlim rather than Monitor or Lock because this is called from tight loops and performance is important. Also do not use ReaderWriterLock because that is slower than Monitor or Lock. Note: The reason for Queue is because for what we are doing it is way faster than List. The only thing faster (and not by that much) would be to implement a custom circular buffer arrayy (NDArray[] Buffer)

`

private System.Threading.ReaderWriterLockSlim SyncRootTrainBatch { get; } = new System.Threading.ReaderWriterLockSlim();
private Queue<NDArray> TrainBatch { get; } = new Queue<NDArray>();
protected NDArray TrainBuffer_Get(out bool GotData)
{
    NDArray Result;
    Result = null;
    GotData = false;
    try
    {
        SyncRootTrainBatch.EnterUpgradeableReadLock();
        if (TrainBatch.Count == 0)
        {
            //Silly but NDArray cannot be used with != null operator
            Result = null;
        }
        else
        {
            try
            {
                SyncRootTrainBatch.EnterWriteLock();
                Result = TrainBatch.Dequeue();
                GotData = true; 
            }
            finally
            {
                SyncRootTrainBatch.ExitWriteLock();
            }
        }
    }
    finally
    {
        SyncRootTrainBatch.ExitUpgradeableReadLock();
    }
    return Result;
}
protected void TrainBuffer_Set(NDArray Data)
{
    int NumSamples = Data.shape[0];
    if (NumSamples < 1)
    {
        return;
    }
    try
    {
        SyncRootTrainBatch.EnterWriteLock();
        TrainBatch.Enqueue(Data);
    }
    finally
    {
        SyncRootTrainBatch.ExitWriteLock();
    }
}
protected int TrainBuffer_HasData()
{
    try
    {
        SyncRootTrainBatch.EnterReadLock();
        return TrainBatch.Count;
    }
    finally
    {
        SyncRootTrainBatch.ExitReadLock();
    }
}

`

Your data preparation thread code could be some appropriate variant of this: Note: This example is for infinite duration training and Epochs is used to feed the same set of batches multiple times. Adjust this to fit your scenario

` // Data preparation thread example

    int BatchSize;
    List<NDArray> DataBuffer;

    int NumBatches;
    int Epochs; // Set this to whatever is appropriate
    bool HasData;
    NumBatches = 0;
    while (IsTraining)
    {
        try
        {
            int MaxBufferSize;
            MaxBufferSize = Config.ServerTensorFlowThreads * 25;
            if (MaxBufferSize < 50)
            {
                MaxBufferSize = 50;
            }
            while (IsTraining && (TrainBuffer_HasData() > MaxBufferSize))
            {
                System.Threading.Thread.Sleep(1);
            }
            if (IsTraining)
            {
                // Do all your data prep here and assign the final end result NDArray to DataBuffer.
                var frame = cv2.resize(frame, (800, 600));
                DataBuffer  = image_np.reshape(1, frame.shape[0], frame.shape[1], 3);
                // etc...

                for (int I = 0; I < Epochs; I++)
                {
                    foreach (NDArray DataBatch in DataBuffer)
                    {
                        TrainBuffer_Set(DataBatch);
                    }
                }
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());
            System.Threading.Thread.Sleep(1000);
        }
    }

`

Your Tensorflow training thread could be some variant of this:

` // Example tensorflow training thread method

private void TensorflowTrainingThread()
{
    public Operation Optimizer { get; set; }
    public Tensor Loss { get; set; }
    public Tensor Input { get; set; } // = tf.placeholder(tf.float32, shape: new int[2] { -1, datasize}, name: "Input");
    private void TensorflowTrainingThread()
    {
        try
        {
            NDArray DataBatch;
            bool GotData;
            DataBatch = TrainBuffer_Get(out GotData);
            while (!GotData)
            {
                System.Threading.Thread.Sleep(1); //Thread sleep of less than 1 millisecond starts becoming like a spinwait.
                DataBatch = TrainBuffer_Get(out GotData);
            }

            Sess.run((Optimizer, Loss), (TrainCortex.Input, DataBatch));
        }
        catch (Exception ex)
        {
            Console.WriteLine(ex.ToString());
        }
    }
}

`

Spinning up the threads for the data preparation could be something like this:

` //Spinning up the threads for preparing data

for (int I = 0; I < 4; I++) //for 4 threads - tune this to whatever to the number of cores and the data preparation cpu cost etc...
{
    m_IsTraining = true;
    TS = new System.Threading.ThreadStart(TrainPrepareData);
    TrainPrepareThread = new System.Threading.Thread(TS);
    TrainPrepareThread.Start();
}

`

If your GPU is fast enough. Or maybe if you are running multiple GPU's then you could do something like this to feed tensorflow using multiple threads.

` // Feed tensorflow using multiple threads or not... // Depending on your model feeding it from multiple threads may produce slightly different training results. // However Unless you are running multiple GPU's most of the time a single feeding thread is more than enough

    List<Task> tasks = new List<Task>();
    Task Runner;
    int MaxRunners; // concurrent task count
    if (MaxRunners > 1)
    {
        Runner = Task.Run(() => TrainProcessBatch());
        tasks.add(Runner);
        if (tasks.Count > MaxRunners)
        {
            Task.WaitAll(tasks.ToArray());
            tasks.Clear();
        }
    }
    else
    {
        TrainProcessBatch();
    }`

Hope this helps.

Oceania2018 commented 4 years ago

@tcwicks It defenitely helps us and other people who are using tf.net. NumSharp should be optimized in terms of performance. Thank you for the complete code sample. It would be great if you can push this code in the example project.

tcwicks commented 4 years ago

@Oceania2018 I am new to github. Do I have access to push this to the example project.

Also what I would really like to do is create a modular building blocks project with various fully functional building blocks like this. I get stuck trying to help with the core tensorflow. but I am quite good at writing this stuff instead.

Actually what I'm currently working on is writing a replacement for Unity ML Agents using Sci Sharp Tensorflow. A replacement as in fully multi threaded, distributed, and allowing for modular custom brain designs, etc... I ended up here after 5 months of pure frustration with Python.

Also a request. It would be really nice if we could have an overload of Sess.Run which takes a array of feeditems but does NOT cast the return result to an NDArray. and instead returns just a float[] or a []

The reason is because that way we can completely skip the overheads of numsharp where numsharp is not needed or performance is critical.

sorry I never said thanks for freeing us non Python people from Python.

Oceania2018 commented 4 years ago

I've invited you to join SciSharp STACK members. You can fork or new branch on tf.net.

Mghobadid commented 4 years ago

i forget notice that i use anaconda environment , and anaconda use Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide how about this ? this is may cause of lower speed?

solarflarefx commented 4 years ago

@Mghobadid were you able to solve your problem? I seem to be experiencing a difference in performance as well.

svenrog commented 3 years ago

Performance difference is actually quite big, running a rather deep model (~200 layers) can make the compute time go from seconds (python) to minutes (.net). Also: using SciSharp.TensorFlow.Redist-Windows-GPU with a Geforce 3080 is a couple seconds slower than just running SciSharp.TensorFlow.Redist on an overclocked I7.

Oceania2018 commented 3 years ago

@svenrog It will help us if you can narrow down the root cause with sample code provieded.