Open ludnic opened 1 year ago
I suspect it is referring to the 64 bits in the c_int64_t value so we can use it as an integer rather than a bit array. If I remember there were some bit functions in the stdlib, which may be useful here as a quick check. I'll pull in the stdlib as a dev dep.
great, I saw your changes. From the comment I thought the validity mask could be an array and wasn't sure how to handle it on the fortran side. but it seems to be a scalar (at least in the test)
ah I see now. it is always a scalar and gets reassigned.
actually reading this again. I'm still confused. validity seems to be an array in the general case. Looking at the alternative way that is suggested to check the validity of an entry:
idx_t entry_idx = row_idx / 64;
idx_t idx_in_entry = row_idx % 64;
bool is_valid = validity_mask[entry_idx] & (1 « idx_in_entry);
validity_mask
should be an array and if the row index is greater than 64 entry_idx
will point to the second element of the array (index 1). So it still seems we should get it in fortran as an array? In this test row_count is 40, so the entry_idx should be always 0.
latest fix correctly casts the validity pointer to a 64bit integer. The bit strings are now correct.
validity is a pointer to a variable in memory. In this case an int64, which in turn is an array of 64 bits.
In fortran, if you wanted to check the bit you could use: btest(validity,·row_idx+1)
I've added this to the code to demo
it seems to me that btest(validity, row_idx+1)
(btw probably the bit counting should start at 0 in btest?) does the same as validity & (1 << idx_in_entry)
, but what happens if row_idx > 64?
Noted a potential issue that I've not considered but which was highlighted in the
duckdb_validity_set_row_validity(validity,·0,·.false.)
function.
The helper function casts validity from int64 to c_int64_t and in doing so changes the reference of validity. This requires a recast to update validity - another c interop gotcha.
I've updated https://github.com/freevryheid/duckdb/blob/main/src/duckdb.f90#L2357 to illustrate.
I see the issue! Thanks for the note, I hadn't realised!
Makes me wonder if we should make the _ functions public as well. This would require the use of iso_c_binding but would negate the overhead of the helper functions.
I quite like the current look of the API with the helper functions, it seems quite clean and better than having to deal with the iso_c_binding types. maybe it is me not being used to it, but having to deal with the casting every time you want to deal with duckdb seems ugly and error prone so I think it's better to do the painful helpers just once.
latest fix returns validity as a c_ptr that then needs to be casted as a c_int64_t. Trick was to forego using a helper function for this as the reference created by the helper function for the returned validity is not pointing to the chunks reference - we had the same issue with the duckdb_validity_set_row_validity as indicated previously. This fix requires changes to the previous tests checking validity, I commented these out for testing. I'll address when I find time later.
fixed the validity tests - the drawback of this approach is that we need to work with pointers and import iso_c_binding but then this is a c library after all.
I have seen the changes in the validity tests and have used the approach for the chunk varchar test.
However I think there might be an issue with the validity mask still, since in this test the index of duckdb_validity_row_is_valid(vector_validity, i)
goes higher than 64 so I'm not sure what happens now that the validity mask is an int64
pointer.
@ludnic you were right from the start all along. So I guess validity is better handled as a c_ptr. It is an int64 array but not sure how many elements it comprises. It should still be possible to convert to a int64 array as in c_f_pointer(c_ptr, f_int64, shape=[]).
thanks for fixing it. just passing it on as a c_ptr and getting the right result from duckdb_validity_row_is_valid
works very well and you don't really need to access the validity mask directly I guess.
I'm still not clear how to deal with the pointers to vectors since we don't know the size. There is another case that is not as straightforward in the data_chunk_varchar_result test I'm looking at. I will make a separate issue in case you have suggestions.
the next steps in the data chunk test involve retrieving the validity mask of a vector:
and I was already looking earlier at the other vector functions. However from the api documentation I see:
the validity mask is a bitset saved in int64 (possibly array?). I was looking for ways of getting it into fortran but not sure what is the right way of doing it. I was thinking it should be an allocatable int64 array but not sure how to get the size.