NVIDIA / CUDALibrarySamples

CUDA Library Samples
Other
1.5k stars 311 forks source link

cg_example error #178

Closed jlxy11 closed 4 months ago

jlxy11 commented 6 months ago
### Tasks
jlxy11 commented 6 months ago

matrix name: parabolic_fem.mtx num. rows: 525825 num. cols: 525825 nnz: 4200450 structure: symmetric

Matrix parsing... Testing CG cuSPARSE API failed at line 517 with error: zero pivot (9)

essex-edwards commented 6 months ago

@jlxy11 It looks like this is not a universal problem, but depends on some combination of environmental factors. Please provide information about your environment. What version of the toolkit are you using, which compiler, compiler flags, what CUDA hardware, which OS, which drivers, etc.

jlxy11 commented 6 months ago

@essex-edwards Thank you for your reply. I built the program on vs2019. The operating system is win11. The cuda package version used is 12.0. The operating hardware is RTX 4080 laptop. No other changes were made except linking the cusparse and cublas libraries.@

@jlxy11 It looks like this is not a universal problem, but depends on some combination of environmental factors. Please provide information about your environment. What version of the toolkit are you using, which compiler, compiler flags, what CUDA hardware, which OS, which drivers, etc.

jlxy11 commented 6 months ago

@essex-edwards Thank you for your reply. I built the program on vs2019. The operating system is win11. The cuda package version used is 12.0. The operating hardware is RTX 4080 laptop. No other changes were made except linking the cusparse and cublas libraries.@

@jlxy11 It looks like this is not a universal problem, but depends on some combination of environmental factors. Please provide information about your environment. What version of the toolkit are you using, which compiler, compiler flags, what CUDA hardware, which OS, which drivers, etc.

In addition, I made a little change to the code and moved part of the code as shown in the picture outside the mtx_parsing function, because the original code reported an error here

1709042556319
essex-edwards commented 6 months ago

@jlxy11 I have reproduced the error you are seeing. I get a zero pivot on line 532, not line 517, but it seems likely that we are encountering the same error. I don't have a fix or a workaround for you. Thank you for the bug report.

Assorted details below:

I'm using MSVC 2022 (Version 17.9.2), Windows 11, Toolkit version 12.3, and a laptop with an RTX 3500. This is a little different from your setup. It might explain why the reported line number is different.

I had to do the same change to IdxType and sort_by_row that you did. I also had to change the CMake:

 target_link_libraries(${ROUTINE}_example
-    PUBLIC cudart cusparse cublas
+    PUBLIC CUDA::cudart CUDA::cusparse CUDA::cublas
 )

and replace fseek with get/unget

@@ -96,6 +96,16 @@ typedef struct VecStruct {

 //==============================================================================

+int fpeek(FILE* stream)
+{
+    int c;
+    c = fgetc(stream);
+    ungetc(c, stream);
+    return c;
+}
 void mtx_header(const char* file_path,
                 int*        num_lines,
                 int*        num_rows,
@@ -123,14 +133,21 @@ void mtx_header(const char* file_path,
     }
     token = strtok(NULL, " \n"); // symmetric, unsymmetric
     *is_symmetric = (strcmp(token, "symmetric") == 0);
-    while (fgetc(file) == '%')
+    while (fpeek(file) == '%')
         fgets(buffer, 256, file); // skip % comments
-    fseek(file, -1, SEEK_CUR);
     fscanf(file, "%d %d %d", num_rows, num_cols, num_lines);
     *nnz = (*is_symmetric) ? *num_lines * 2 : *num_lines;
     fclose(file);
 }
jlxy11 commented 6 months ago

@jlxy11 I have reproduced the error you are seeing. I get a zero pivot on line 532, not line 517, but it seems likely that we are encountering the same error. I don't have a fix or a workaround for you. Thank you for the bug report.

Assorted details below:

I'm using MSVC 2022 (Version 17.9.2), Windows 11, Toolkit version 12.3, and a laptop with an RTX 3500. This is a little different from your setup. It might explain why the reported line number is different.

I had to do the same change to IdxType and sort_by_row that you did. I also had to change the CMake:

 target_link_libraries(${ROUTINE}_example
-    PUBLIC cudart cusparse cublas
+    PUBLIC CUDA::cudart CUDA::cusparse CUDA::cublas
 )

and replace fseek with get/unget

@@ -96,6 +96,16 @@ typedef struct VecStruct {

 //==============================================================================

+int fpeek(FILE* stream)
+{
+    int c;
+    c = fgetc(stream);
+    ungetc(c, stream);
+    return c;
+}
 void mtx_header(const char* file_path,
                 int*        num_lines,
                 int*        num_rows,
@@ -123,14 +133,21 @@ void mtx_header(const char* file_path,
     }
     token = strtok(NULL, " \n"); // symmetric, unsymmetric
     *is_symmetric = (strcmp(token, "symmetric") == 0);
-    while (fgetc(file) == '%')
+    while (fpeek(file) == '%')
         fgets(buffer, 256, file); // skip % comments
-    fseek(file, -1, SEEK_CUR);
     fscanf(file, "%d %d %d", num_rows, num_cols, num_lines);
     *nnz = (*is_symmetric) ? *num_lines * 2 : *num_lines;
     fclose(file);
 }

Thank you very much for your reply. As you mentioned, I made two changes to this code. I commented out the line: fseek(file, -1, SEEK_CUR); in the mtx_header function, so that the mtx_header function test can be passed correctly. , and will not affect the mtx_parsing function, but I don’t understand why these two problems occur. I will try to solve this problem according to the method you provided later. Finally, thank you again for your patient answer! @essex-edwards

essex-edwards commented 4 months ago

@jlxy11 We updated the cg_example in this commit https://github.com/NVIDIA/CUDALibrarySamples/commit/9a7897fb0c4f4a718178b310fa4f0034451e8a14 . The example should work now, without a zero pivot error.