Wayne171171 commented 2 years ago

Hello tom91136

I am trying do the miniBUDE)/sycl/ project and try to implement on opeapi. However i find the data files' content(.in file) from data/bm1 and data/bm2 are almost garbled text. May you update correct data?

zjin-lcf commented 2 years ago

Are these .in files in binary format ? Thanks.

Wayne171171 commented 2 years ago

Right, that's what I talked about! sorry for the late reply.

And I have other question, if i want to run it on Intel devcloud and using their hardware FPGA, how to write the makefile and makefile.FPGA? May you share them? (I am referring this project( /miniBUDE)/sycl/) to modify my code)

zjin-lcf commented 2 years ago

These .in files in binary are not correct because they are not representative of real data. Is that right ?

https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming may be helpful for your work.

Wayne171171 commented 2 years ago

No, i mean this problem as below picture so i cannot make sure if they are correct.

[image: image.png]

zjin-lcf @.***> 於 2022年7月7日週四下午2:58寫道：

These .in files in binary are not correct because they are not representative of real data. Is that right ?

https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming may be helpful for your work.

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1177659273, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHJPZYX7L6364CGCHTDVS3O77ANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

zjin-lcf commented 2 years ago

The image is not visible...

Wayne171171 commented 2 years ago

The image shows garbled contents inside the .in file. Did you successfully get the data when you run your code?

zjin-lcf @.***> 於 2022年7月7日週四下午3:12寫道：

The image is not visible...

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1177679722, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHPJDZIBAMNXZLKZGCTVS3QTHANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

zjin-lcf commented 2 years ago

Do you consider a binary file as a garbled text file ?

A sample run: ./main --deck ../minibude-sycl/data/bm1 --wgsize 256 --iterations 100 Poses : 65536 Iterations: 100 Ligands : 26 Proteins : 938 Deck : ../minibude-sycl/data/bm1 WG : 256

Kernel time: 2311.810 ms
Average time: 23.118 ms
Interactions/s: 69.136 billion
GFLOP/s: 2836.202 Largest difference was 0.003%.

Wayne171171 commented 2 years ago

ok, I will try it.

zjin-lcf @.***> 於 2022年7月7日週四下午3:20寫道：

Do you consider a binary file as a garbled text file ?

A sample run: ./main --deck ../minibude-sycl/data/bm1 --wgsize 256 --iterations 100 Poses : 65536 Iterations: 100 Ligands : 26 Proteins : 938 Deck : ../minibude-sycl/data/bm1 WG : 256

Kernel time: 2311.810 ms

Average time: 23.118 ms

Interactions/s: 69.136 billion

GFLOP/s: 2836.202 Largest difference was 0.003%.

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1177692347, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHOKOKCJL26KJ6FNER3VS3RSRANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

Wayne171171 commented 2 years ago

Hi Zjin

sorry about my fault (I consider a binary file as a garbled text file). But I meet some problem when I run your code on devcloud fpga hardwares as below description.

Follow your project in HeCBench/minibude-sycl, I I combined bude.h, kernel.cpp and main.cpp as one cpp file and I can run this on intel devcloud to use GPU, CPU and FPGA emulator. However when I try to run this by Intel devcloud FPGA hardwares, it will get some errors. May you give me some suggestions about how to modify the code? it seems the code have CL_INVALID_BINARY and CL_INVALID_WORK_GROUP issues which produce runtime_error and range error respectively.

error message as below:

for arria10 error : terminate called after throwing an instance of 'cl::sycl::runtime_error' what(): Native API failed. Native API returns: -42 (CL_INVALID_BINARY) -42 (CL_INVALID_BINARY) make: *** [run_fpga_hardware] Aborted for stratix10 error: terminate called after throwing an instance of 'cl::sycl::nd_range_error' what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP

zjin-lcf commented 2 years ago

Oh, it is not your fault. Is there a SYCL device_selector for FPGAs ? Do you mind posting the error message at https://community.intel.com/t5/Toolkits-SDKs/ct-p/toolkits-sdks ? Your question will be answered by developers.

Wayne171171 commented 2 years ago

Hi zjin

yes, I added these parts into the code as below.

in the title I add these:

#if FPGA || FPGA_EMULATOR#include <sycl/ext/intel/fpga_extensions.hpp>#endif

in the runkernel function I add these:

*#if FPGA_EMULATOR
// DPC++ extension: FPGA emulator selector on systems without FPGA card.
ext::intel::fpga_emulator_selector d_selector;#elif FPGA
// DPC++ extension: FPGA selector on systems with FPGA card.
ext::intel::fpga_selector d_selector;

else

// The default device selector will select the most performant device. default_selector d_selector;
// gpu_selector d_selector;#endif //
try {
// queue q(d_selector, dpc_common::exception_handler); sycl::queue q(d_selector, dpc_common::exception_handler);
// queue q(d_selector);* . . . . .

ok, I will send a case via the Intel community link. If you have any idea how to modify the code, please provide me with some suggestions.

thanks

zjin-lcf @.***> 於 2022年7月7日週四晚上10:58寫道：

Oh, it is not your fault. Is there a SYCL device_selector for FPGAs ? Do you mind posting the error message at https://community.intel.com/t5/Toolkits-SDKs/ct-p/toolkits-sdks ? Your question will be answered by developers.

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1178269232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHNKW6RKMMJK437X5S3VS5HJVANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

Wayne171171 commented 2 years ago

Hi zjin

According to the error message, I think these issues are about local_work_size and global_work_size. Maybe the local_work_size exceed global_work_size while running Intel FPGA hw. Do you have any idea how to modify smaller size or something else size definition about your fasten_main BUDE kernel function?

void fasten_main( handler& h, // size_t posesPerWI, size_t wgSize, size_t ntypes, size_t nposes, size_t natlig, size_t natpro, accessor<Atom, 1, sycl_read, sycl_gmem> protein_molecule, accessor<Atom, 1, sycl_read, sycl_gmem> ligand_molecule, accessor<float, 1, sycl_read, sycl_gmem> transforms_0, accessor<float, 1, sycl_read, sycl_gmem> transforms_1, accessor<float, 1, sycl_read, sycl_gmem> transforms_2, accessor<float, 1, sycl_read, sycl_gmem> transforms_3, accessor<float, 1, sycl_read, sycl_gmem> transforms_4, accessor<float, 1, sycl_read, sycl_gmem> transforms_5, accessor<FFParams, 1, sycl_read, sycl_gmem> forcefield, accessor<float, 1, sycl_discard_write, sycl_gmem> etotals) {

constexpr const auto FloatMax = std::numeric_limits<float>::max();

size_t global = ceil((nposes) / static_cast<double> (NUM_TD_PER_THREAD));
global = wgSize * ceil(static_cast<double> (global) / wgSize);

accessor<FFParams, 1, sycl_read_write, sycl_lmem> local_forcefield(range<1>(ntypes), h);

h.parallel_for<class bude_kernel>(nd_range<1>(global, wgSize), [=](nd_item<1> item) {

    const size_t lid = item.get_local_id(0);
    const size_t gid = item.get_group(0);
    const size_t lrange = item.get_local_range(0);

    float etot[NUM_TD_PER_THREAD];
    cl::sycl::float3 lpos[NUM_TD_PER_THREAD];
    cl::sycl::float4 transform[NUM_TD_PER_THREAD][3];

    size_t ix = gid * lrange * NUM_TD_PER_THREAD + lid;
    ix = ix < nposes ? ix : nposes - NUM_TD_PER_THREAD;

    for (int i = lid; i < ntypes; i += lrange) local_forcefield[i] = forcefield[i];
    //if (ix < ntypes) local_forcefield[ix] = forcefield[ix];

    // Compute transformation matrix to private memory
    for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
        size_t index = ix + i * lrange;

        const float sx = cl::sycl::sin(transforms_0[index]);
        const float cx = cl::sycl::cos(transforms_0[index]);
        const float sy = cl::sycl::sin(transforms_1[index]);
        const float cy = cl::sycl::cos(transforms_1[index]);
        const float sz = cl::sycl::sin(transforms_2[index]);
        const float cz = cl::sycl::cos(transforms_2[index]);

        transform[i][0].x() = cy * cz;
        transform[i][0].y() = sx * sy * cz - cx * sz;
        transform[i][0].z() = cx * sy * cz + sx * sz;
        transform[i][0].w() = transforms_3[index];
        transform[i][1].x() = cy * sz;
        transform[i][1].y() = sx * sy * sz + cx * cz;
        transform[i][1].z() = cx * sy * sz - sx * cz;
        transform[i][1].w() = transforms_4[index];
        transform[i][2].x() = -sy;
        transform[i][2].y() = sx * cy;
        transform[i][2].z() = cx * cy;
        transform[i][2].w() = transforms_5[index];

        etot[i] = ZERO;
    }

    item.barrier(access::fence_space::local_space);

    // Loop over ligand atoms
    size_t il = 0;
    do {
        // Load ligand atom data
        const Atom l_atom = ligand_molecule[il];
        const FFParams l_params = local_forcefield[l_atom.type];
        const bool lhphb_ltz = l_params.hphb < ZERO;
        const bool lhphb_gtz = l_params.hphb > ZERO;

        const cl::sycl::float4 linitpos(l_atom.x, l_atom.y, l_atom.z, ONE);
        for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
            // Transform ligand atom
            lpos[i].x() = transform[i][0].w() +
                linitpos.x() * transform[i][0].x() +
                linitpos.y() * transform[i][0].y() +
                linitpos.z() * transform[i][0].z();
            lpos[i].y() = transform[i][1].w() +
                linitpos.x() * transform[i][1].x() +
                linitpos.y() * transform[i][1].y() +
                linitpos.z() * transform[i][1].z();
            lpos[i].z() = transform[i][2].w() +
                linitpos.x() * transform[i][2].x() +
                linitpos.y() * transform[i][2].y() +
                linitpos.z() * transform[i][2].z();
        }

        // Loop over protein atoms
        size_t ip = 0;
        do {
            // Load protein atom data
            const Atom p_atom = protein_molecule[ip];
            const FFParams p_params = local_forcefield[p_atom.type];

            const float radij = p_params.radius + l_params.radius;
            const float r_radij = 1.f / (radij);

            const float elcdst = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? FOUR : TWO;
            const float elcdst1 = (p_params.hbtype == HBTYPE_F && l_params.hbtype == HBTYPE_F) ? QUARTER : HALF;
            const bool type_E = ((p_params.hbtype == HBTYPE_E || l_params.hbtype == HBTYPE_E));

            const bool phphb_ltz = p_params.hphb < ZERO;
            const bool phphb_gtz = p_params.hphb > ZERO;
            const bool phphb_nz = p_params.hphb != ZERO;
            const float p_hphb = p_params.hphb * (phphb_ltz && lhphb_gtz ? -ONE : ONE);
            const float l_hphb = l_params.hphb * (phphb_gtz && lhphb_ltz ? -ONE : ONE);
            const float distdslv = (phphb_ltz ? (lhphb_ltz ? NPNPDIST : NPPDIST) : (lhphb_ltz ? NPPDIST : -FloatMax));
            const float r_distdslv = 1.f / (distdslv);

            const float chrg_init = l_params.elsc * p_params.elsc;
            const float dslv_init = p_hphb + l_hphb;

            for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
                // Calculate distance between atoms
                const float x = lpos[i].x() - p_atom.x;
                const float y = lpos[i].y() - p_atom.y;
                const float z = lpos[i].z() - p_atom.z;

                const float distij = cl::sycl::sqrt(x * x + y * y + z * z);

                // Calculate the sum of the sphere radii
                const float distbb = distij - radij;
                const bool zone1 = (distbb < ZERO);

                // Calculate steric energy
                etot[i] += (ONE - (distij * r_radij)) * (zone1 ? 2 * HARDNESS : ZERO);

                // Calculate formal and dipole charge interactions
                float chrg_e = chrg_init * ((zone1 ? 1 : (ONE - distbb * elcdst1)) * (distbb < elcdst ? 1 : ZERO));
                const float neg_chrg_e = -cl::sycl::fabs(chrg_e);
                chrg_e = type_E ? neg_chrg_e : chrg_e;
                etot[i] += chrg_e * CNSTNT;

                // Calculate the two cases for Nonpolar-Polar repulsive interactions
                const float coeff = (ONE - (distbb * r_distdslv));
                float dslv_e = dslv_init * ((distbb < distdslv&& phphb_nz) ? 1 : ZERO);
                dslv_e *= (zone1 ? 1 : coeff);
                etot[i] += dslv_e;
            }
        } while (++ip < natpro); // loop over protein atoms
    } while (++il < natlig); // loop over ligand atoms

    // Write results
    const size_t td_base = gid * lrange * NUM_TD_PER_THREAD + lid;

    if (td_base < nposes) {
        for (size_t i = 0; i < NUM_TD_PER_THREAD; i++) {
            etotals[td_base + i * lrange] = etot[i] * HALF;
        }
    }
    });

Wayne171171 commented 2 years ago

Hi zjin

Also, why you define WGSIZE 4 and what's the meaning of it ? How to change it? I guess this parameter is also related to the issues I mentioned above.

ifndef DEFAULT_WGSIZE

define DEFAULT_WGSIZE 4

endif

thanks

zjin-lcf commented 2 years ago

size_t global = ceil((nposes) / static_cast<double> (NUM_TD_PER_THREAD));
global = wgSize * ceil(static_cast<double> (global) / wgSize);

accessor<FFParams, 1, sycl_read_write, sycl_lmem> local_forcefield(range<1>(ntypes), h);

h.parallel_for<class bude_kernel>(nd_range<1>(global, wgSize), [=](nd_item<1> item) {

For the global and local work sizes:

range<1> gws (global);
range<1> lws (wgSize);
h.parallel_for<class bude_kernel>(nd_range<1>(gws, lws), [=](nd_item<1> item) {

Work-group size can be set at the command line using "--wgsize ". Does the above changes help ? If not, please wait for answers from Intel.

Wayne171171 commented 2 years ago

Hi zjin

ok, I will try to modify wgSize. Can I directly change this definition on the code? (#define DEFAULT_WGSIZE 4), cause I see you define it as params.wgSize = DEFAULT_WGSIZE;

Also, do you know where can i find ceil funtion original definition? I might need to refer it to consider how to change wgsizw.

    size_t global = ceil((nposes) / static_cast<double> (NUM_TD_PER_THREAD));
    global = wgSize * ceil(static_cast<double> (global) / wgSize);

Wayne171171 commented 2 years ago

I have send the case to intel but they haven't response me tile now. So maybe we can try to debug it first.

Wayne171171 commented 2 years ago

Another issue is about (CL_INVALID_BINARY), after checking some websites. It seems about binary file issues.

CL_INVALID_BINARY if an invalid program binary was encountered for any device. binary_status will return specific status for each device.

https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateProgramWithBinary.html

zjin-lcf commented 2 years ago

Hi zjin

ok, I will try to modify wgSize. Can I directly change this definition on the code? (#define DEFAULT_WGSIZE 4), cause I see you define it as params.wgSize = DEFAULT_WGSIZE;

Also, do you know where can i find ceil funtion original definition? I might need to refer it to consider how to change wgsizw.
    size_t global = ceil((nposes) / static_cast<double> (NUM_TD_PER_THREAD));
    global = wgSize * ceil(static_cast<double> (global) / wgSize);

https://cplusplus.com/reference/cmath/ceil/

Wayne171171 commented 2 years ago

Hi zjin

If I want to try different conditions about decreasing total local size, how to change the parameters on the code directly? I am using visual studio to compile the code. Can you provide me some suggestion about changing the number of the parameters?

zjin-lcf @.***> 於 2022年7月8日週五下午3:07寫道：

size_t global = ceil((nposes) / static_cast (NUM_TD_PER_THREAD)); global = wgSize * ceil(static_cast (global) / wgSize);

accessor<FFParams, 1, sycl_read_write, sycl_lmem> local_forcefield(range<1>(ntypes), h);

h.parallel_for(nd_range<1>(global, wgSize), [=](nd_item<1> item) {

For the global and local work size:

range<1> gws (global); range<1> lws (wgSize); h.parallel_for(nd_range<1>(gws, lws), [=](nd_item<1> item) {

Work-group size can be set at the command line using "--wgsize ". Does the above changes help ? If not, please wait for answers from Intel.

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1179030556, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHN4EXGNAGBPPLJWY5DVTAY3FANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

zjin-lcf commented 2 years ago

Any updates from Intel Forum about the errors ?

Wayne171171 commented 2 years ago

It seems they are on holiday... [image: image.png]

zjin-lcf @.***> 於 2022年7月11日週一晚上10:41寫道：

Any updates from Intel Forum about the errors ?

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1180897800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHPUCEMO6UOXKA4Z4GTVTSIH5ANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

Wayne171171 commented 2 years ago

Hi zjin

Does it make sense if I try the number of wgsize as 256 128 64 32 16...? Or I can try other numbers?

Wayne Huang @.***> 於 2022年7月11日週一晚上10:45寫道：

It seems they are on holiday... [image: image.png]

zjin-lcf @.***> 於 2022年7月11日週一晚上10:41寫道：

Any updates from Intel Forum about the errors ?

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1180897800, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHPUCEMO6UOXKA4Z4GTVTSIH5ANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

zjin-lcf commented 2 years ago

Isn't the work-group size a power-of-two number when the runtime reports the following errors ?

'what(): Native API failed. Native API returns: -42 (CL_INVALID_BINARY) -42 (CL_INVALID_BINARY) make: *** [run_fpga_hardware] Aborted for stratix10 error: terminate called after throwing an instance of 'cl::sycl::nd_range_error' what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP'

Wayne171171 commented 2 years ago

It's power of two, I tried 256 and 128 by modified --wgsize command

zjin-lcf @.***> 於 2022年7月11日週一晚上11:26寫道：

Isn't the work-group size a power-of-two number when the runtime reports the following errors ?

'what(): Native API failed. Native API returns: -42 (CL_INVALID_BINARY) -42 (CL_INVALID_BINARY) make: *** [run_fpga_hardware] Aborted for stratix10 error: terminate called after throwing an instance of 'cl::sycl::nd_range_error' what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP'

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1180937284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHL7KXTQQ4WFXRWYWT3VTSNS5ANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

Wayne171171 commented 2 years ago

Maybe I can try to decrease the complexity of the BUDE kernel. For example, making the arithmetical operation simpler and then try on devcloud ?

Wayne Huang @.***> 於 2022年7月11日週一晚上11:31寫道：

It's power of two, I tried 256 and 128 by modified --wgsize command

zjin-lcf @.***> 於 2022年7月11日週一晚上11:26寫道：

Isn't the work-group size a power-of-two number when the runtime reports the following errors ?

'what(): Native API failed. Native API returns: -42 (CL_INVALID_BINARY) -42 (CL_INVALID_BINARY) make: *** [run_fpga_hardware] Aborted for stratix10 error: terminate called after throwing an instance of 'cl::sycl::nd_range_error' what(): Non-uniform work-groups are not supported by the target device -54 (CL_INVALID_WORK_GROUP'

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1180937284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHL7KXTQQ4WFXRWYWT3VTSNS5ANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

zjin-lcf commented 2 years ago

It is a good idea for debugging a simplified kernel. My ssh access to DevCloud fails for a while. I will try to build the mini-bude example for an FPGA when the issue is solved.

Wayne171171 commented 2 years ago

thanks for your support and reply ! Because I am doing my project in Bristol, I have to complete the tasks that professor need as more as possible.

I will continue to try it.

zjin-lcf @.***> 於 2022年7月11日週一晚上11:39寫道：

It is a good idea for debugging a simplified kernel. My ssh access to DevCloud fails for a while. I will try to build the mini-bude example for an FPGA when the issue is solved.

— Reply to this email directly, view it on GitHub https://github.com/UoB-HPC/miniBUDE/issues/25#issuecomment-1180966195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZ53XHNNWWHRMR3HUMRUNQTVTSPB3ANCNFSM52X5GPCA . You are receiving this because you authored the thread.Message ID: @.***>

zjin-lcf commented 2 years ago

I hope that the course instructor may work with you to understand the cause of the errors. You are welcome.

UoB-HPC / miniBUDE

UoB-HPC / miniBUDE Public/data problem #25

else

ifndef DEFAULT_WGSIZE

define DEFAULT_WGSIZE 4

endif