freevryheid / duckdb

fortran bindings to duckdb c api
MIT License
9 stars 2 forks source link

validity mask #16

Open ludnic opened 1 year ago

ludnic commented 1 year ago

the next steps in the data chunk test involve retrieving the validity mask of a vector:

do chunk_idx = 0, chunk_count - 1
        chunk = duckdb_result_get_chunk(result, chunk_idx)
        row_count = duckdb_data_chunk_get_size(chunk)
        do row_idx = 0, row_count - 1
          do col_idx = 0, col_count - 1
            ! Get the column
            vector = duckdb_data_chunk_get_vector(chunk, col_idx);
            validity = duckdb_vector_get_validity(vector)

and I was already looking earlier at the other vector functions. However from the api documentation I see:

If all values are valid, this function MIGHT return NULL!
The validity mask is a bitset that signifies null-ness within the data chunk. 
It is a series of uint64_t values, where each uint64_t value contains validity for 64 tuples. 
The bit is set to 1 if the value is valid (i.e. not NULL) or 0 if the value is invalid (i.e. NULL).

the validity mask is a bitset saved in int64 (possibly array?). I was looking for ways of getting it into fortran but not sure what is the right way of doing it. I was thinking it should be an allocatable int64 array but not sure how to get the size.

freevryheid commented 1 year ago

I suspect it is referring to the 64 bits in the c_int64_t value so we can use it as an integer rather than a bit array. If I remember there were some bit functions in the stdlib, which may be useful here as a quick check. I'll pull in the stdlib as a dev dep.

ludnic commented 1 year ago

great, I saw your changes. From the comment I thought the validity mask could be an array and wasn't sure how to handle it on the fortran side. but it seems to be a scalar (at least in the test)

ludnic commented 1 year ago

ah I see now. it is always a scalar and gets reassigned.

ludnic commented 1 year ago

actually reading this again. I'm still confused. validity seems to be an array in the general case. Looking at the alternative way that is suggested to check the validity of an entry:

idx_t entry_idx = row_idx / 64; 
idx_t idx_in_entry = row_idx % 64; 
bool is_valid = validity_mask[entry_idx] & (1 « idx_in_entry);

validity_mask should be an array and if the row index is greater than 64 entry_idx will point to the second element of the array (index 1). So it still seems we should get it in fortran as an array? In this test row_count is 40, so the entry_idx should be always 0.

freevryheid commented 1 year ago

latest fix correctly casts the validity pointer to a 64bit integer. The bit strings are now correct.

validity is a pointer to a variable in memory. In this case an int64, which in turn is an array of 64 bits.

In fortran, if you wanted to check the bit you could use: btest(validity,·row_idx+1)

I've added this to the code to demo

lnicotina-rms commented 1 year ago

it seems to me that btest(validity, row_idx+1) (btw probably the bit counting should start at 0 in btest?) does the same as validity & (1 << idx_in_entry), but what happens if row_idx > 64?

freevryheid commented 1 year ago

Noted a potential issue that I've not considered but which was highlighted in the

duckdb_validity_set_row_validity(validity,·0,·.false.)

function.

The helper function casts validity from int64 to c_int64_t and in doing so changes the reference of validity. This requires a recast to update validity - another c interop gotcha.

I've updated https://github.com/freevryheid/duckdb/blob/main/src/duckdb.f90#L2357 to illustrate.

ludnic commented 1 year ago

I see the issue! Thanks for the note, I hadn't realised!

freevryheid commented 1 year ago

Makes me wonder if we should make the _ functions public as well. This would require the use of iso_c_binding but would negate the overhead of the helper functions.

ludnic commented 1 year ago

I quite like the current look of the API with the helper functions, it seems quite clean and better than having to deal with the iso_c_binding types. maybe it is me not being used to it, but having to deal with the casting every time you want to deal with duckdb seems ugly and error prone so I think it's better to do the painful helpers just once.

freevryheid commented 1 year ago

latest fix returns validity as a c_ptr that then needs to be casted as a c_int64_t. Trick was to forego using a helper function for this as the reference created by the helper function for the returned validity is not pointing to the chunks reference - we had the same issue with the duckdb_validity_set_row_validity as indicated previously. This fix requires changes to the previous tests checking validity, I commented these out for testing. I'll address when I find time later.

freevryheid commented 1 year ago

fixed the validity tests - the drawback of this approach is that we need to work with pointers and import iso_c_binding but then this is a c library after all.

ludnic commented 1 year ago

I have seen the changes in the validity tests and have used the approach for the chunk varchar test.

However I think there might be an issue with the validity mask still, since in this test the index of duckdb_validity_row_is_valid(vector_validity, i) goes higher than 64 so I'm not sure what happens now that the validity mask is an int64 pointer.

freevryheid commented 1 year ago

@ludnic you were right from the start all along. So I guess validity is better handled as a c_ptr. It is an int64 array but not sure how many elements it comprises. It should still be possible to convert to a int64 array as in c_f_pointer(c_ptr, f_int64, shape=[]).

ludnic commented 1 year ago

thanks for fixing it. just passing it on as a c_ptr and getting the right result from duckdb_validity_row_is_valid works very well and you don't really need to access the validity mask directly I guess.

I'm still not clear how to deal with the pointers to vectors since we don't know the size. There is another case that is not as straightforward in the data_chunk_varchar_result test I'm looking at. I will make a separate issue in case you have suggestions.