AIRLegend / aitrack

6DoF Head tracking software
MIT License
1.03k stars 102 forks source link

Very high CPU usage #119

Open SeTSmith opened 2 years ago

SeTSmith commented 2 years ago

Very High CPU usage with a Logitech C270HD on a 5900x , CPU usage goes up to 5% sometimes, and it mantains at 3.6% aprox. of CPU usage.

I have tryed various configurations, changing model type (the three options), reducing fps from 60 to 30. With and without preview, but I think is getting too much resources, not sure if it is normal this behaviour.

AIRLegend commented 2 years ago

Hey!

That doesn't seem a very high CPU usage to me (note that all processing is made on the CPU). It should be the similar to a youtube video playing.

The only recommendation I could give you is to set the lowest possible resolution your camera supports. That could reduce it a little bit. However, most of the computation is related to the neural network(s).

I'll unmark this as a bug.

SeTSmith commented 2 years ago

Thanks for the quick response!

I tryed to lower the webcam resolution (was using 640x480) to half, it lowers the CPU usage but the tracking is just not as good as 640x480. I didn't read this tip in tips sections until this morning.

If this is the normal CPU usage, ok, good to know, I think is very high, but if it is normal, accepted. (Youtube video playing, in my system is between 0.6 and 0.8% at 1080p... from that to 5% there is a lot of diference)...

Anyways, really appreciated the software, it is really superb!

searching46dof commented 2 years ago

First I would like to thank you all for really good implementation. Works very well on my system. I was looking over the sources and noticed that CPU usage could be reduced by combining several for loops. The one in transpose and the ones in cv:divide and cv::subtract.

void ImageProcessor::normalize_and_transpose(cv::Mat& image, float dest, int dim_x, int dim_y) { const int stride = dim_x dim_y;

float* from = (float*)image.data;
for (int channel = 0; channel < 3; channel++)
{
    float& std_scaline_for_channel = std_scaling[channel];
    float& mean_scaling_for_channel = mean_scaling[channel];

    for (int i = 0; i < stride; i++)
    {
        float& from_element = from[channel + i * 3];
        from_element /= std_scaline_for_channel; /* remove internal for for loop of cv::divide(image, std_scaling, image); */
        from_element -= mean_scaling_for_channel; /* remove internal for for loop of cv::subtract(image, mean_scaling, image); */

        dest[i + stride * channel] = from_element; /* transpose */
    }
}

}

void Tracker::detect_face { ...

if 0

improc.normalize(resized);
improc.transpose((float*)resized.data, buffer_data);

else

improc.normalize_and_transpose(resized, buffer_data);

endif

and void Tracker::detect_landmarks { ...

if 0

improc.normalize(resized);
improc.transpose((float*)resized.data, buffer_data);

else

improc.normalize_and_transpose(resized, buffer_data);

endif

I tried to set up a build environment but have problems referencing external dependencies

searching46dof commented 2 years ago

In MAFilter::MAFilter, dynamically allocated memory buffer this->circular_buffer is not initialized.

The function MAFilter::filter can also be optimized by caching the sum which can be simply recalculated by subtracting the old value and adding the new value. The sample below initalizes the buffer and optimizes the function. This would allow larger steps for improved filtering to suppress spikes and reduce jitter.

MAFilter::MAFilter(int steps, int array_size) { ... this->circular_buffer = new float[steps array_size]; this->sum[array_size]; for (int i = 0; i < array_size; i++) this->sum[i] = nanf(""); } MAFilter::~MAFilter() { delete[] this->circular_buffer; delete[] this->sum; } void MAFilter::filter(float in_array, float out_array) { int offset = this->idx this->array_size; for (int i = 0; i < this->array_size; i++) {

if OPTIMIZE

    if (isnan(this->sum[i]))
    {
        // initialize empty circular_buffer with new value
        for (int j = 0; j < this->n_steps; j++)
        {
            this->circular_buffer[j * this->array_size + i] = in_array[i];
        }
        // initialize sum
        this->sum[i] = in_array[i] * this->n_steps;
        // calculate average
        out_array[i] = this->sum[i] / this->n_steps;
    }
    else
    {
        // Recalculate sum
        this->sum[i] = this->sum[i] - this->circular_buffer[offset + i] + in_array[i];
        // calculate average
        out_array[i] = this->sum[i] / this->n_steps;
        // Insert current position
        this->circular_buffer[offset + i] = in_array[i];
    }

else

    // Insert current position
    this->circular_buffer[offset + i] = in_array[i];
    out_array[i] = 0;

    // get mean of all steps for this position
    for (int j = 0; j < this->n_steps; j++)
    {
        out_array[i] += this->circular_buffer[j * this->array_size + i];
    }

    out_array[i] /= this->n_steps;

endif

}

this->idx = (this->idx + 1) % this->n_steps;

}

searching46dof commented 2 years ago

For benchmark comparison, I'm running a Ryzen 5700g resolution (640x480 vs 320x240) doesn't seem to affect the CPU much setting the model (fast and heavy) also doesn't seem to affect the CPU much. the FPS does affect the CPU: FPS = 60 -> CPU = ~12% FPS = 30 -> CPU = ~6%

AIRLegend commented 2 years ago

Thanks @searching46dof! Could be nice checking those changes. Sadly, I don't have access to a windows machine (with camera) right now...

It would be cool if you, somehow, managed to configure the project and test them.

However, when I did profile the app I found (obviously) that the NN precitions part (specially the landmark model) take the most part of the computation...

searching46dof commented 2 years ago

I managed to set up a build envronment. I need to find a way to instrument those functions with the same data to measure any speed improvements. However, the CPU util at both 60FPS and 30FPS was about the same so it's likely more of the CPU was spent during other parts of the processing.

The cv::resize is suspect since it involves copying parts of a large buffer to a smaller buffer.

session_lm->Run seems to be single threaded as indicated in the constructor Tracker::Tracker. if it can be converted to multi-threaded, then each thread would only need to operated on a single row instead of the entire buffer. This would limit the contiguous run-time, force preemption and lower CPU util. Another way is to cooperatively preempt by inserting sleep in the library functions between processing each row. You can also use a sleep(0) which will not reduce CPU but does allow preemption to allow backround threads to run more smoothly instead of relying on active preemption.

Tracker::proc_heatmaps also performs linear searches for a maximum in heatmaps for a specific landmark. Using a reference to heatmaps[offset] within the inner i for loop to reduce the 2 parameter indirection heatmaps[offset+i] will improve performance.

The functions log, tan, atan are very CPU expensive. Maybe pre-populate a hash table to perform lookups.

In FaceData::to_string, concatenating std::string is expensive. each is a malloc followed by memcpy's. maybe a snprintf followed by a single std::string. I only see it being used in PositionSolver::solve_rotation output to stdio. maybe disable it by default and enabled via a debugging option.

CPU util reduction will likely require multiple small optimizations instead of a single area.

searching46dof commented 2 years ago

When building at highest warning level W4 I detected 2 areas which may need clarification.

PositionSolver::PositionSolver constructor, some parameters (float prior_pitch, float prior_yaw, and float prior_distance) are not used and hardcode member variables: this->prior_pitch = -1.57; this->prior_yaw = -1.57; this->prior_distance = prior_distance * -2.; these are different from the method prototype float prior_pitch = -2.f, float prior_yaw = -2.f, float prior_distance = -1.f, What was the reason for hardcoding them and should the method prototype be set to the same values?

In PositionSolver::solve_rotation, the left hand expression is a float. The right hand expression is a float with an int typecast. landmark_points_buffer.at(i, j) = (int)face_data->landmark_coords[2 * contour_idx + j]; Was this intentional to truncate the fractional part of the value?

The AItracker vcproj was also configured with optimizations disabled. Was this intentional because of some issue?

AIRLegend commented 2 years ago

Those are the prior radian values for someone's head looking directly at the camera. During prototyping I found the PnP solving algorithm found better solutions by fixing that as a prior (as it's the natural "resting" position, most of the time the head rotation will be near that point). I think they'd be better off as default args for the constructor. If you don't mind, please, add it to your PR.

The second, I don't remember TBH. I think it's because of what you're saying. Those are supposed to be "pixels", so I don't want the fractional part. If removing the casting doesn't mess up the solvePnP solutions I'm okay with removing it.

Regarding the optimizations, I don't remember (again 😅). I believed the releases were being built with the highest optimization level. If setting it up doesn't break anything it should be active.

searching46dof commented 2 years ago

I was using #ifdef DEBUG_OUTPUT_FACE_DATA instead of #ifdef _DEBUG because of some strange conditon where VS only has Debug under Configuration pulldown menu even though I see other configurations in the project. I needed to re-install VS.

Added additional change to the boundary values for logit using the same resolution 10^-7 helps reduce jitter.

I'm trying to profile the changes but those functions in the debug build are not showing up in various profilers (e.g. AMD uProf and even VS's integrated profiler). Eyeballing the CPU, it may be 1%-2% lower from 7%. I'll continue to try to profile the timing for more accurate metrics.

searching46dof commented 1 year ago

In Tracker:Tracker, configure for parallel processing instead of single threaded mode to use less CPU session_options.SetInterOpNumThreads(0); // size of the CPU thread pool used for executing multiple request concurrently, 0 = Use default optimal thread count session_options.SetIntraOpNumThreads(0); // size of the CPU thread pool used for executing a single graph, 0 = Use default optimal thread count