Double peak memory cost in `cast_memory_op`

Describe the bug

In esolver_ks_pw.cpp:

    this->kspw_psi = GlobalV::device_flag == "gpu" 
                         || GlobalV::precision_flag == "single"
                         ? new psi::Psi<T, Device>(this->psi[0])
                         : reinterpret_cast<psi::Psi<T, Device>*>(this->psi);

the constructor of Psi used the function of cast_memory_op:

template <typename T_out, typename T_in>
struct cast_memory<T_out, T_in, container::DEVICE_CPU, container::DEVICE_GPU> {
    void operator()(
        T_out* arr_out,
        const T_in* arr_in,
        const size_t& size)
    {
        auto * arr = (T_in*) malloc(sizeof(T_in) * size);
        cudaErrcheck(cudaMemcpy(arr, arr_in, sizeof(T_in) * size, cudaMemcpyDeviceToHost));
        for (int ii = 0; ii < size; ii++) {
            arr_out[ii] = static_cast<T_out>(arr[ii]);
        }
        free(arr);
    }
};

the temporary memory of arr is same as Psi, which should be optimized as soon as possible.

Expected behavior

No response

To Reproduce

No response

Environment

No response

Additional Context

No response

Task list for Issue attackers (only for developers)

[ ] Verify the issue is not a duplicate.
[ ] Describe the bug.
[ ] Steps to reproduce.
[ ] Expected behavior.
[ ] Error message.
[ ] Environment details.
[ ] Additional context.
[ ] Assign a priority level (low, medium, high, urgent).
[ ] Assign the issue to a team member.
[ ] Label the issue with relevant tags.
[ ] Identify possible related issues.
[ ] Create a unit test or automated test to reproduce the bug (if applicable).
[ ] Fix the bug.
[ ] Test the fix.
[ ] Update documentation (if necessary).
[ ] Close the issue and inform the reporter (if applicable).

deepmodeling / abacus-develop

Double peak memory cost in `cast_memory_op` #4153

Describe the bug

Expected behavior

To Reproduce

Environment

Additional Context

Task list for Issue attackers (only for developers)