[FEA]: Allow default initialization for thrust vectors

bernhardmgruber commented 4 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

Thrust

Is your feature request related to a problem? Please describe.

Thrust vectors perform value-initialization, like std::vector. That is, non-trivially constructible class types will have their constuctors run and all primitive types are zero-initialized. This pessimizes use cases where we want to create vectors which are inteded to be overwritten entirely before any read occurs. While smart compilers can optimize unnecessary initialization away for std::vector, this is much more difficult for e.g. thrust::device_vector, since it's constructor has to allocate device memory and run a kernel for initialization.

Describe the solution you'd like

I would like to have a constructor overload on all thrust vector's that performs default initialization. That is, non-trivially constructible class types will have their constuctors run but all primitive data types are left uninitialized. Since thrust vectors are often used with primitive types, there is great potential to save unnecessary kernel runs and discarded writes for vectors used to provide storage for outputs from thrust algorithms.

Note that I am not proposing to skip initialization. Default constructed elements contained in thrust vector do have their lifetimes begun and assigning/writing to them is defined behavior. Only reading from default initialized and only primitive elements is undefined behavior (before C++26) / erroneous behavior (since C++26) though, so this facility requires certain care.

Describe alternatives you've considered

Calling raw cudaMalloc and wrapping the result in a unique_ptr.

Additional context

No response

bernhardmgruber commented 4 months ago

The implementation takes about 32 lines.

jrhemstad commented 4 months ago

I think this would be a violation of principle of least surprise to have some types get value initialized and some types get default initialized.

You are correct that there is a need for the functionality you describe, but it should be a different type. Design and implementation of such a type is underway. See https://github.com/NVIDIA/cccl/issues/1502

bernhardmgruber commented 4 months ago

I think this would be a violation of principle of least surprise to have some types get value initialized and some types get default initialized.

I kind of disagree. thrust::device_vector<T>s appearing in random locations should retain the same promises as they did before. The escape hatch I am describing is intended for weakening the invariants in a local scope, from the point of construction of the vector to the point of writing to it. It is of course the expert programmer's responsibility to ensure this and there is potential for misuse.

I imagine the use of this facility like this. Old behavior:

thrust::device_vector<T> v(100); // first write
// full guarantees are given here
thrust::fill(v.begin(). v.end(), 42);// second write (overwrites)
// full guarantees are given here
v.resize(10000); // ordinary value initialization
// full guarantees are given here

New behavior:

thrust::device_vector<T> v(100, thrust::i_know_what_I_am_doing::default_init); // just default init
// partial guarantees are given here, no reading of primitive types allowed
thrust::fill(v.begin(). v.end(), 42);// first write
// full guarantees are given here
v.resize(10000); // ordinary value initialization
// full guarantees are given here

This behavior is btw. consistent with std::array<T, N> a;, T a[N], T t;, new T[n]. All of these operations construct non-trivially constructible class types but leave primitive types uninitialized. There is precedence.

This behavior is not equivalent to raw allocation using malloc or cudaMalloc (untyped and uninitialized), memory resources (same), or cuda::experimental::uninitialized_buffer<T> etc.

I further considered whether initialization should be a property of the allocator, but this will lead exactly to the problem you described. Suddenly a vector behaves totally different on all of it's operations.

You are correct that there is a need for the functionality you describe, but it should be a different type. Design and implementation of such a type is underway. See #1502

I also disagree. I don't want to have yet another vector type that only differs in the kind of initialization performed during construction, but is otherwise identical to any other thrust vectors. There is no use to make this difference in the type system:

void f(thrust::device_vector<T>& vec) { ... }
void f(thrust::default_initdevice_vector<T>& vec) { ... }

In both functions, I want vec to have the exact same behavior and offer the exact same guarantees.

AFAIU, the memory resources tracked in #1502 offer faciliities for uninitialized (and potentially untyped) memory, which is a different story and not the intent of my feature request.

jrhemstad commented 4 months ago

This behavior is btw. consistent with std::array<T, N> a;, T a[N], T t;, new T[n]. All of these operations construct non-trivially constructible class types but leave primitive types uninitialized. There is precedence.

All of those are distinct types from std::vector, and there is no precedent for the functionality you're describing with a std::vector.

Therefore, this tells me the precedence is to create a new type distinct from thrust::device_vector.

miscco commented 3 months ago

I am with @jrhemstad here. All examples above are static arrays where they follow the classical C initialization rules.

However, thrust::{device, host}_vector are containers that by default value initialize their elements.

That said, we are currently in the process of working on cuda::vector which will have the same issue. Currently we are leaning towards allowing a vector(size_t, uninitialized_t) constructor that would not initialize the elements at all, which is slightly different than default construction

bernhardmgruber commented 3 months ago

To add another example, boost::container::vector<T> has exactly the feature I would like to have in thrust: https://www.boost.org/doc/libs/1_85_0/doc/html/container/extended_functionality.html#container.extended_functionality.default_initialialization

bernhardmgruber commented 3 months ago

Currently we are leaning towards allowing a vector(size_t, uninitialized_t) constructor that would not initialize the elements at all, which is slightly different than default construction

This is quite dangerous to the inexperienced programmer, since you cannot, generically speaking, assign to elements of such a vector, but have to construct them first. I think that a tag to default initialize is the far safer option here, since it still omits initialization for primitive data types, but runs constructors when they exist.

miscco commented 3 months ago

I believe the important issue is that you want to be able to completely avoid launching any operation to initialize a memory range that you will be overwriting soon.

In my world those types where default initialization is actually a win are those types where assignment is also trivial, so the "danger" is not materialized

bernhardmgruber commented 3 months ago

We had a big discussion about this matter today and here is my summary:

We agreed on the following:

we want to have an API to allocate and maintain a typed piece of memory, bypassing the cost of initializing the elements contained in this piece of memory.
we like the API to be comparable in features to thrust::device_vector, but we can also live with a reduced API set and stick with thrust::device_vector where necessary (e.g. as destination for a remove_if followed by erase)
if we extended thrust::device_vector, we would extend the constructor and resize function with an overload taking a tag type (e.g. default_initialized_t)
such an API has been asked for by several users
this API is intended for advanced users who know what they do
the usual escape hatch to cudaMalloc is undesirable
we have strong use cases for this API ourselves in the unit test suite
the creating of this API is not of P0 priority

Alternatives:

The proposed cuda::experimental::uninitialized_buffer<T> was the most strongly endorsed alternative, mostly because of consistent behavior. The memory is always uninitialized independently of the element type. If T has a non-trivial constructor, the constructor is not run and accessing elements without placment new results in UB. This is fine, users need to know what they are doing. thrust::device_vector retains the guarantee to always be value initialized. No garbage values (default constructed ints or floats) exist. Both types, uninitialized_buffer<T> and device_vector<T>, maintain their behavior independently of their element type T. Furthermore, both types upheld their invariants immideately after construction.

By contrast, thrust::device_vector<T> v(n, default_init_t) would perform no initialization when T is trivially default-constructible, and would compile and run a kernel if T is not trivially default-constructible. Thus, the construction behavior is less visible in the type. Furthermore, the aforementioned constructur, when T is trivially default-constructible, leaves the vector v in a less usable state, since it cannot be read from (e.g., default constructed ints contain garbage values). A second step has to be performed (i.e., assigning to each element), before the usual guarantees of thrust::device_vector (any element can be fully read) are fully upheld. The two step initialization was generally perceived as unfulfilling. However, thrust::device_vector<T> guarantees that a non-trivial constructor will always be run, thus preferring safety over consistent behavior/performance.

Rapid's udevice_vector<T> was brought up for comparison, but determined to be safer, since it static_asserts if T is not trivially default-constructible, whereas cuda::experimental::uninitialized_buffer<T> will silently omit construction. It's the programmer's responsibility to handle element construction themselves.

It was noted that if T is a trivial type, cuda::experimental::uninitialized_buffer<T> and thrust::device_vector<T> v(n, default_init_t) behave the same, so the discussion shifted to what to with non-trivially-constructible Ts. Those types are less common, but examples would be complex<T> or a 4x4 matrix type that initializes to an identity matrix. It was argued that these types are often also trivially destructible and trivially copy-constructible, so it would not matter if the constructor is run or not, since values can be overwritten with memcpy (as permitted by trivial copy-constructability) anyway.

One person argued that the compile time of thrust::device_vectors is rather high, so we would want something simpler than thrust::device_vector<T> for the unit tests anyway.

Another argued that having a dedicated type to thrust::device_vector may require API duplication, since we would want to, after construction, use that type the same way and for the same things as thrust::device_vector (e.g. within Catch2).

Since we found use cases for both, default initialized and uninitialized memory, maybe both features are justified. If we provide a tag for thrust::device_vector for default initialization, we should also provide a tag for no initialization

Further design alternatives discussed:

using the allocator of thrust::device_vector to perform only default-initialization instead of value-initialization. This has the drawback that we cannot choose differently at each construction or resize how to initialize, since the behavior is fixed to the type. Also, we are uncertain how well custom allocators are supported in Thrust's existing interfaces.
using the new cuda::vector<T> with an appropriate memory resource
adding a constructor to thrust::device_vector from a cuda::experimental::uninitialized_buffer<T>

bernhardmgruber commented 4 days ago

I was recently reminded about this by a WG21 reflector discussion, pointing to this paper https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2021/p1072r10.html#tag, concluding that "LEWG should explore tags for allocator-aware containers". It also lists Boost as setting precedence for such a non-standard extension.

Interestingly, it also lists thrust::detail::temporary_array<T> as example of omitting value-initialization if is_trivially_copy_constructible<T> allows. This leaves me wondering why we are not using thrust::detail::temporary_array<T> EVERYWHERE?

NVIDIA / cccl