fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.05k stars 1.79k forks source link

gtx 1050ti poor performance than gtx 770 #1707

Open Sevilla404 opened 6 years ago

Sevilla404 commented 6 years ago

Im using xmr-stak 2.4.5 over ubuntu 16.04 on Corei5 with a Gtx 770 and gtx 1050ti. Im mining over cryptonight-fast algoritm (msr) and im getting worse performance on the 1050ti. I noticed the 1050ti its only using 1 GB Memory.. The nvidia.txt showed is the default made by xmr-stak

"gpu_threads_conf" : [ // gpu: GeForce GTX 1050 Ti architecture: 61 // memory: 3981/4040 MiB // smx: 6 { "index" : 0, "threads" : 28, "blocks" : 18, "bfactor" : 2, "bsleep" : 0, "affine_to_cpu" : false, "sync_mode" : 3, }, // gpu: GeForce GTX 770 architecture: 30 // memory: 1945/1998 MiB // smx: 8 { "index" : 1, "threads" : 36, "blocks" : 24, "bfactor" : 0, "bsleep" : 0, "affine_to_cpu" : false, "sync_mode" : 3, },

],

HASHRATE REPORT - NVIDIA | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 547.5 | 549.5 | (na) | 1 | 656.6 | 656.7 | (na) | Totals (NVIDIA): 1204.1 1206.1 0.0 H/s

Mon Jul 9 23:54:14 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 396.26 Driver Version: 396.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 770 On | 00000000:01:00.0 N/A | N/A | | 69% 79C P0 N/A / N/A | 1782MiB / 1998MiB | N/A Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 105... On | 00000000:04:00.0 Off | N/A | | 40% 60C P0 N/A / 74W | 1073MiB / 4040MiB | 100% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 Not Supported | | 1 9373 C ./xmr-stak 1063MiB | +-----------------------------------------------------------------------------+

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2016 NVIDIA Corporation Built on Tue_Jan_10_13:22:03_CST_2017 Cuda compilation tools, release 8.0, V8.0.61

cmake-LA:

-- The C compiler identification is GNU 5.4.0 -- The CXX compiler identification is GNU 5.4.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found CUDA: /usr/local/cuda (found suitable version "8.0", minimum required is "7.5") -- Looking for CL_VERSION_2_0 -- Looking for CL_VERSION_2_0 - found -- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.0") -- Found OpenSSL: /usr/lib/x86_64-linux-gnu/libssl.so;/usr/lib/x86_64-linux-gnu/libcrypto.so (found version "1.0.2g") fatal: Not a git repository (or any of the parent directories): .git fatal: Not a git repository (or any of the parent directories): .git -- Configuring done -- Generating done -- Build files have been written to: /crypt/xmr-stak-2.4.5 -- Cache values CMAKE_AR:FILEPATH=/usr/bin/ar CMAKE_BUILD_TYPE:STRING=Release CMAKE_COLOR_MAKEFILE:BOOL=ON CMAKE_CXX_COMPILER:FILEPATH=/usr/bin/c++ CMAKE_CXX_FLAGS:STRING= CMAKE_CXX_FLAGS_DEBUG:STRING=-g CMAKE_CXX_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG CMAKE_CXX_FLAGS_RELEASE:STRING=-O3 -DNDEBUG CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG CMAKE_C_COMPILER:FILEPATH=/usr/bin/cc CMAKE_C_FLAGS:STRING= CMAKE_C_FLAGS_DEBUG:STRING=-g CMAKE_C_FLAGS_MINSIZEREL:STRING=-Os -DNDEBUG CMAKE_C_FLAGS_RELEASE:STRING=-O3 -DNDEBUG CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG CMAKE_EXE_LINKER_FLAGS:STRING= CMAKE_EXE_LINKER_FLAGS_DEBUG:STRING= CMAKE_EXE_LINKER_FLAGS_MINSIZEREL:STRING= CMAKE_EXE_LINKER_FLAGS_RELEASE:STRING= CMAKE_EXE_LINKER_FLAGS_RELWITHDEBINFO:STRING= CMAKE_EXPORT_COMPILE_COMMANDS:BOOL=OFF CMAKE_INSTALL_PREFIX:PATH=/crypt/xmr-stak-2.4.5 CMAKE_LINKER:FILEPATH=/usr/bin/ld CMAKE_LINK_STATIC:BOOL=OFF CMAKE_MAKE_PROGRAM:FILEPATH=/usr/bin/make CMAKE_MODULE_LINKER_FLAGS:STRING= CMAKE_MODULE_LINKER_FLAGS_DEBUG:STRING= CMAKE_MODULE_LINKER_FLAGS_MINSIZEREL:STRING= CMAKE_MODULE_LINKER_FLAGS_RELEASE:STRING= CMAKE_MODULE_LINKER_FLAGS_RELWITHDEBINFO:STRING= CMAKE_NM:FILEPATH=/usr/bin/nm CMAKE_OBJCOPY:FILEPATH=/usr/bin/objcopy CMAKE_OBJDUMP:FILEPATH=/usr/bin/objdump CMAKE_RANLIB:FILEPATH=/usr/bin/ranlib CMAKE_SHARED_LINKER_FLAGS:STRING= CMAKE_SHARED_LINKER_FLAGS_DEBUG:STRING= CMAKE_SHARED_LINKER_FLAGS_MINSIZEREL:STRING= CMAKE_SHARED_LINKER_FLAGS_RELEASE:STRING= CMAKE_SHARED_LINKER_FLAGS_RELWITHDEBINFO:STRING= CMAKE_SKIP_INSTALL_RPATH:BOOL=NO CMAKE_SKIP_RPATH:BOOL=NO CMAKE_STATIC_LINKER_FLAGS:STRING= CMAKE_STATIC_LINKER_FLAGS_DEBUG:STRING= CMAKE_STATIC_LINKER_FLAGS_MINSIZEREL:STRING= CMAKE_STATIC_LINKER_FLAGS_RELEASE:STRING= CMAKE_STATIC_LINKER_FLAGS_RELWITHDEBINFO:STRING= CMAKE_STRIP:FILEPATH=/usr/bin/strip CMAKE_VERBOSE_MAKEFILE:BOOL=FALSE CPU_ENABLE:BOOL=ON CUDA_64_BIT_DEVICE_CODE:BOOL=ON CUDA_ARCH:STRING=30;35;37;50;52;20;60;61;62 CUDA_ATTACH_VS_BUILD_RULE_TO_CUDA_FILE:BOOL=ON CUDA_BUILD_CUBIN:BOOL=OFF CUDA_BUILD_EMULATION:BOOL=OFF CUDA_COMPILER:STRING=nvcc CUDA_CUDART_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcudart.so CUDA_CUDA_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libcuda.so CUDA_ENABLE:BOOL=ON CUDA_GENERATED_OUTPUT_DIR:PATH= CUDA_HOST_COMPILATION_CPP:BOOL=ON CUDA_HOST_COMPILER:FILEPATH=/usr/bin/cc CUDA_KEEP_FILES:BOOL=OFF CUDA_NVCC_EXECUTABLE:FILEPATH=/usr/local/cuda/bin/nvcc CUDA_NVCC_FLAGS:STRING= CUDA_NVCC_FLAGS_DEBUG:STRING= CUDA_NVCC_FLAGS_MINSIZEREL:STRING= CUDA_NVCC_FLAGS_RELEASE:STRING= CUDA_NVCC_FLAGS_RELWITHDEBINFO:STRING= CUDA_PROPAGATE_HOST_FLAGS:BOOL=ON CUDA_SDK_ROOT_DIR:PATH=CUDA_SDK_ROOT_DIR-NOTFOUND CUDA_SEPARABLE_COMPILATION:BOOL=OFF CUDA_SHOW_CODELINES:BOOL=OFF CUDA_SHOW_REGISTER:BOOL=OFF CUDA_TARGET_CPU_ARCH:STRING= CUDA_TOOLKIT_INCLUDE:PATH=/usr/local/cuda/include CUDA_TOOLKIT_ROOT_DIR:PATH=/usr/local/cuda CUDA_TOOLKIT_TARGET_DIR:PATH=/usr/local/cuda CUDA_USE_STATIC_CUDA_RUNTIME:BOOL=ON CUDA_VERBOSE_BUILD:BOOL=OFF CUDA_VERSION:STRING=8.0 CUDA_cublas_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcublas.so CUDA_cudart_static_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcudart_static.a CUDA_cufft_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcufft.so CUDA_cupti_LIBRARY:FILEPATH=/usr/local/cuda/extras/CUPTI/lib64/libcupti.so CUDA_curand_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcurand.so CUDA_cusolver_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcusolver.so CUDA_cusparse_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libcusparse.so CUDA_nppc_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libnppc.so CUDA_nppi_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libnppi.so CUDA_npps_LIBRARY:FILEPATH=/usr/local/cuda/lib64/libnpps.so CUDA_rt_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/librt.so EXECUTABLE_OUTPUT_PATH:STRING=bin HWLOC:FILEPATH=/usr/lib/x86_64-linux-gnu/libhwloc.so HWLOC_ENABLE:BOOL=ON HWLOC_INCLUDE_DIR:PATH=/usr/include LIBRARY_OUTPUT_PATH:STRING=bin MHTD:FILEPATH=/usr/lib/x86_64-linux-gnu/libmicrohttpd.so MICROHTTPD_ENABLE:BOOL=ON MTHD_INCLUDE_DIR:PATH=/usr/include OPENSSL_CRYPTO_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libcrypto.so OPENSSL_INCLUDE_DIR:PATH=/usr/include OPENSSL_SSL_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libssl.so OpenCL_ENABLE:BOOL=ON OpenCL_INCLUDE_DIR:PATH=/usr/include OpenCL_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libOpenCL.so OpenSSL_ENABLE:BOOL=ON PKG_CONFIG_EXECUTABLE:FILEPATH=/usr/bin/pkg-config XMR-STAK_COMPILE:STRING=native XMR-STAK_LARGEGRID:BOOL=ON XMR-STAK_THREADS:STRING=0

./xmr-stak --version-long Version: xmr-stak/2.4.5/b3f79de3/unknown/lin/nvidia-amd-cpu/aeon-cryptonight-monero/20

clinfo: clinfo: /usr/local/cuda-8.0/targets/x86_64-linux/lib/libOpenCL.so.1: no version information available (required by clinfo) Number of platforms 1 Platform Name NVIDIA CUDA Platform Vendor NVIDIA Corporation Platform Version OpenCL 1.2 CUDA 9.2.106 Platform Profile FULL_PROFILE Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer Platform Extensions function suffix NV

Platform Name NVIDIA CUDA Number of devices 2 Device Name GeForce GTX 1050 Ti Device Vendor NVIDIA Corporation Device Vendor ID 0x10de Device Version OpenCL 1.2 CUDA Driver Version 396.26 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Profile FULL_PROFILE Device Topology (NV) PCI-E, 04:00.0 Max compute units 6 Max clock frequency 1392MHz Compute Capability (NV) 6.1 Device Partition (core) Max number of sub-devices 1 Supported partition types None Max work item dimensions 3 Max work item sizes 1024x1024x64 Max work group size 1024 Preferred work group size multiple 32 Warp size (NV) 32 Preferred / native vector sizes char 1 / 1 short 1 / 1 int 1 / 1 long 1 / 1 half 0 / 0 (n/a) float 1 / 1 double 1 / 1 (cl_khr_fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Address bits 64, Little-Endian Global memory size 4236312576 (3.945GiB) Error Correction support No Max memory allocation 1059078144 (1010MiB) Unified memory for Host and Device No Integrated memory (NV) No Minimum alignment for any data type 128 bytes Alignment of base address 4096 bits (512 bytes) Global Memory cache type Read/Write Global Memory cache size 98304 Global Memory cache line 128 bytes Image support Yes Max number of samplers per kernel 32 Max size for 1D images from buffer 134217728 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x32768 pixels Max 3D image size 16384x16384x16384 pixels Max number of read image args 256 Max number of write image args 16 Local memory type Local Local memory size 49152 (48KiB) Registers per block (NV) 65536 Max constant buffer size 65536 (64KiB) Max number of constant args 9 Max size of kernel argument 4352 (4.25KiB) Queue properties Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Kernel execution timeout (NV) No Concurrent copy and kernel execution (NV) Yes Number of async copy engines 2 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Available Yes Compiler Available Yes Linker Available Yes Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

Device Name GeForce GTX 770 Device Vendor NVIDIA Corporation Device Vendor ID 0x10de Device Version OpenCL 1.2 CUDA Driver Version 396.26 Device OpenCL C Version OpenCL C 1.2 Device Type GPU Device Profile FULL_PROFILE Device Topology (NV) PCI-E, 01:00.0 Max compute units 8 Max clock frequency 1202MHz Compute Capability (NV) 3.0 Device Partition (core) Max number of sub-devices 1 Supported partition types None Max work item dimensions 3 Max work item sizes 1024x1024x64 Max work group size 1024 Preferred work group size multiple 32 Warp size (NV) 32 Preferred / native vector sizes char 1 / 1 short 1 / 1 int 1 / 1 long 1 / 1 half 0 / 0 (n/a) float 1 / 1 double 1 / 1 (cl_khr_fp64) Half-precision Floating-point support (n/a) Single-precision Floating-point support (core) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations Yes Double-precision Floating-point support (cl_khr_fp64) Denormals Yes Infinity and NANs Yes Round to nearest Yes Round to zero Yes Round to infinity Yes IEEE754-2008 fused multiply-add Yes Support is emulated in software No Correctly-rounded divide and sqrt operations No Address bits 64, Little-Endian Global memory size 2095710208 (1.952GiB) Error Correction support No Max memory allocation 523927552 (499.7MiB) Unified memory for Host and Device No Integrated memory (NV) No Minimum alignment for any data type 128 bytes Alignment of base address 4096 bits (512 bytes) Global Memory cache type Read/Write Global Memory cache size 131072 Global Memory cache line 128 bytes Image support Yes Max number of samplers per kernel 32 Max size for 1D images from buffer 134217728 pixels Max 1D or 2D image array size 2048 images Max 2D image size 16384x16384 pixels Max 3D image size 4096x4096x4096 pixels Max number of read image args 256 Max number of write image args 16 Local memory type Local Local memory size 49152 (48KiB) Registers per block (NV) 65536 Max constant buffer size 65536 (64KiB) Max number of constant args 9 Max size of kernel argument 4352 (4.25KiB) Queue properties Out-of-order execution Yes Profiling Yes Prefer user sync for interop No Profiling timer resolution 1000ns Execution capabilities Run OpenCL kernels Yes Run native kernels No Kernel execution timeout (NV) No Concurrent copy and kernel execution (NV) Yes Number of async copy engines 1 printf() buffer size 1048576 (1024KiB) Built-in kernels Device Available Yes Compiler Available Yes Linker Available Yes Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer

NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] Success [NV] clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No platform

No one has been overclocked..

Spudz76 commented 6 years ago

Clone the 1050 section twice more so there are three host threads running on the single card and see what it does total. It will not combine the stats for you as the hashrate reports are by (host) thread, so it will show four nvidias (or three with the 770 commented out).

Try the 36/24 combo from the 770. Default is to use 3smx for blocks, and it is no smarter than that - but when adjusting blocks you should use increments of smx. And then `threads blocks 2` is the memory required in 2MB-CN (approx.) so if that stays below the 3981 free AND* threads can't be above 64 or CUDA will complain and fail at startup. But at some point too many blocks will choke the smx so you have to hunt for the happy point. But at least with the rules above there aren't that many to try.

Autoconfig doesn't work too well when there is more memory than cores, and autoconfig doesn't support host thread multiplications, and host thread multiplication may not work that well on certain CPU/Mobo setups (how things are routed electrically, or how many cores the CPU has that aren't also mining) so that has much to do with why it's a manual process on the 10xx series. It could break just as many users assuming their PCI routing can handle host threads when they can't. Note the 770 has 2 more cores (8 vs 6), it is a ironically better card for cryptonight due to that (but, watts and features...). Parallel host threads can recover some of the performance and increase memory usage, even if the cores are somewhat choking on the workload.

Use --benchmark 7 --benchwait 2 --benchwork 13 to run a 15 second total test and save time. Default --benchmark 7 wastes a whole lot of time for no reason, and runs for much longer than needed for a rough hashrate test.

Spudz76 commented 6 years ago

CUDA 8 apps also work better on older drivers like 387.xx It has some problems with the CUDA 9.2 runtime (in the latest >395) And there was no speed difference in the CUDA 9.1 runtime (between 390 and 395)

Either adjust driver (thus runtime) to match what the miner is compiled for, or compile for the runtime version (check driver readme/pdf, get whatever CUDA SDK for that version). Helps a lot when they match / at least gets rid of mysteries.

Sevilla404 commented 6 years ago

Thanks for your comments..

I tried cloning specs from 1050ti twice and changing the threads and blocks.. Im very newbie on this topic:

// gpu: GeForce GTX 1050 Ti architecture: 61 // memory: 3981/4040 MiB // smx: 6 { "index" : 0, "threads" : 16, "blocks" : 30, "bfactor" : 2, "bsleep" : 0, "affine_to_cpu" : false, "sync_mode" : 3, },

{ "index" : 0, "threads" : 16, "blocks" : 30, "bfactor" : 2, "bsleep" : 0, "affine_to_cpu" : false, "sync_mode" : 3, },

{ "index" : 0, "threads" : 16, "blocks" : 30, "bfactor" : 2, "bsleep" : 0, "affine_to_cpu" : false, "sync_mode" : 3, },

Im getting spikes on the report NVIDIA ID... sometimes i get 200/500h and sometimes only "NA"


HASHRATE REPORT - NVIDIA | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 281.3 | 176.2 | (na) | 1 | 281.3 | 383.0 | (na) | | 2 | (na) | (na) | (na) | 3 | 669.3 | 669.3 | (na) | Totals (NVIDIA): 1231.9 1228.5 0.0 H/s

HASHRATE REPORT - NVIDIA | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 562.6 | 322.7 | (na) | 1 | (na) | 314.3 | (na) | | 2 | (na) | (na) | (na) | 3 | 669.3 | 669.3 | (na) | Totals (NVIDIA): 1231.9 1306.3 0.0 H/s

Sevilla404 commented 6 years ago

Maybe i was wrong and i belived my 1050ti got more H/s than my GTX 770.. so.. .. i think all configs are ok and i was wrong trying to get more H/s ... Thanks anyway!