ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.

Other

48 stars 17 forks source link

Fix overflow when calling parsec_data_create #646

Closed QingleiCao closed 4 months ago

QingleiCao commented 6 months ago

645

QingleiCao commented 6 months ago

I still think that the right approach is to make bsiz size_t.

Also, the build failure is legit (stemming from the added assert):

/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c: In function 'parsec_tiled_matrix_init':
/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c:98:31: error: 'INT_MAX' undeclared (first use in this function)
   98 |     assert((size_t)mb * nb <= INT_MAX);
      |                               ^~~~~~~
/tmp/parsec/parsec/parsec/data_dist/matrix/matrix.c:19:1: note: 'INT_MAX' is defined in header '<limits.h>'; did you forget to '#include <limits.h>'?
   18 | #include "parsec/data_dist/matrix/two_dim_tabular.h"
  +++ |+#include <limits.h>

I'm not sure whether there is potential side-effect as @abouteiller pointed out.

bosilca commented 6 months ago

INT_MAX is not a size_t so even with the correct header the compiler will generate a warning and the comparison will be done in the lesser precisions. You really intend to create data that is over 2GB ?

devreal commented 6 months ago

I'm not sure whether there is potential side-effect as @abouteiller pointed out.

Unless we pass a pointer to bsiz somewhere that expects an int* there should be no side-effect. Compiling with -Wall should help finding any place where that happens, or simply a grep.

INT_MAX is not a size_t so even with the correct header the compiler will generate a warning and the comparison will be done in the lesser precisions.

The int of INT_MAX will be promoted to size_t as per the C integer promotion rules (64bit integer has higher rank than 32bit integer). That assert is legit, albeit not very helpful unless applications run with a debug build of PaRSEC.

devreal commented 6 months ago

A quick grep showed no place where we pass a pointer to bsiz. There are places where we cast bsiz to size_t already:

parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;

and in matrix.c we store the result of fread/fwrite in an int and compare against bsiz. Those functions return size_t though, so that could be adjusted as well.

bosilca commented 6 months ago

Rule 6.3.1.1 applies here:

Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

The conversion will always work here because we are dealing with a constant with a positive value.

QingleiCao commented 6 months ago

You really intend to create data that is over 2GB ?

Yeah, each timeslot includes L^2/2 ZGEMV. the size of each vector is L. Those L^2/2 number of vectors are stored in a file, so I'm loading them together. L can be 720 or 1440. If L = 720, then 720^2/2 720 sizeof(complex double) = 2,985,984,000

QingleiCao commented 6 months ago

A quick grep showed no place where we pass a pointer to bsiz. There are places where we cast bsiz to size_t already:
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
and in matrix.c we store the result of fread/fwrite in an int and compare against bsiz. Those functions return size_t though, so that could be adjusted as well.

A quick grep showed no place where we pass a pointer to bsiz. There are places where we cast bsiz to size_t already:
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
parsec/data_dist/matrix/two_dim_rectangle_cyclic.c:            pos *= (size_t)dc->super.bsiz;
and in matrix.c we store the result of fread/fwrite in an int and compare against bsiz. Those functions return size_t though, so that could be adjusted as well.

Yes, others directly related to bsiz are already protected by size_t.

QingleiCao commented 6 months ago

Any suggestions? Changing bsiz to size_t or this is fine as mb * nb > INT_MAX is rare?

devreal commented 6 months ago

Rule 6.3.1.1 applies here:

Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

The conversion will always work here because we are dealing with a constant with a positive value.

Let's run through this text: INT_MAX is a signed integer. size_t is an unsigned integer with rank higher than int (32 vs 64b on 64bit archs).

if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand

So yes, this rule applies: size_t is the unsigned interger type that has rank greater than int, which is the signed integer type with lower rank.

then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

This requires that INT_MAX (int) is promoted to the type of the operand with the unsigned type (size_t).

Now, if we were to compare size_t against INT_MIN we would be in trouble because promoting a negative number to an unsigned type of higher rank would create a larger number, because 2's complement. But that's not happening here and it would be a dubious thing to do in the first place.

Any suggestions? Changing bsiz to size_t or this is fine as mb * nb > INT_MAX is rare?

Things that encode counts should be size_t to avoid future overflows. The check we're talking about then becomes irrelevant and we can all go home happy :) there is of course the issue of sending such a tile via MPI, which can be solved via the big count API that hopefully we an use soon. My understand though is that in @QingleiCao's case the data remains local.

bosilca commented 6 months ago

@devreal I'm confused by your comment, because it started as a stark contradiction to my statement in the end it basically states exactly the same thing. 😕

Anyway, this PR looks good as it is.

devreal commented 6 months ago

@bosilca You said:

INT_MAX is not a size_t so even with the correct header the compiler will generate a warning and the comparison will be done in the lesser precisions.

I don't think that is what the standard says and I hope I did not convey that in my comment :)

bosilca commented 6 months ago

If the right-hand side was of a different signedness and not a constant, where the compiler can verify that the conversion is safe, then the compiler should have warned that the conversion is unsafe.

QingleiCao commented 6 months ago

Rule 6.3.1.1 applies here:

Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

The conversion will always work here because we are dealing with a constant with a positive value.

Let's run through this text: INT_MAX is a signed integer. size_t is an unsigned integer with rank higher than int (32 vs 64b on 64bit archs).

if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand

So yes, this rule applies: size_t is the unsigned interger type that has rank greater than int, which is the signed integer type with lower rank.

then the operand with signed integer type is converted to the type of the operand with unsigned integer type.

This requires that INT_MAX (int) is promoted to the type of the operand with the unsigned type (size_t).

Now, if we were to compare size_t against INT_MIN we would be in trouble because promoting a negative number to an unsigned type of higher rank would create a larger number, because 2's complement. But that's not happening here and it would be a dubious thing to do in the first place.

Any suggestions? Changing bsiz to size_t or this is fine as mb * nb > INT_MAX is rare?

Things that encode counts should be size_t to avoid future overflows. The check we're talking about then becomes irrelevant and we can all go home happy :) there is of course the issue of sending such a tile via MPI, which can be solved via the big count API that hopefully we an use soon. My understand though is that in @QingleiCao's case the data remains local.

Yeah, in my case, it's local without communication involved.

Well, INT_MAX instead of INT_MIN, right? @devreal

abouteiller commented 6 months ago

there is of course the issue of sending such a tile via MPI, which can be solved via the big count API that hopefully we an use soon.

Indeed, that was my first reaction reading all this. The drawback is that we have to use the large count variants for everything, not just the types.

abouteiller commented 6 months ago

You really intend to create data that is over 2GB ?

Yeah, each timeslot includes L^2/2 ZGEMV. the size of each vector is L. Those L^2/2 number of vectors are stored in a file, so I'm loading them together. L can be 720 or 1440. If L = 720, then 720^2/2 720 sizeof(complex double) = 2,985,984,000

I'm not sure how the patch fixes your issue (at best it delays it to when you want to load a matrix that is 8x larger?), bsiz is still in int and will still be truncated for some (apparently, since less than 1 order of magnitude to the one you use now) reasonable sizes. What is our reasoning for not promoting bsiz to a size_t overall?

QingleiCao commented 5 months ago

I will change bsiz to size_t. I think we also need to change the matrix size as well. I spent quite a while debugging an issue because the matrix (1*N) is larger that INT_MAX (N > INT_MAX).

QingleiCao commented 5 months ago

The matrix size is already cast in the high-level distribution API, e.g., parsec_matrix_sym_block_cyclic_init, so this PR is only about bsiz.