Open MichaelChirico opened 5 months ago
there is a web page which lists all of the API calls, grouped into 3 categories https://yutannihilation.github.io/R-fun-API/ see also the R-devel thread which is linked from that page
You should probably be proactive and contact R-core regarding your need of an API to SETLENGTH
and SET_TRUELENGTH
. I would hate seeing slower performance in data.table. It could cause some difficult decisions at my day job.
if this is important for you (or anyone else reading) please take the time to construct a performance comparison.
I would but unfortunately I think you need at least basic knowledge of C to do that.
AFAIU we can replace SETLENGTH
calls with Rf_lengthgets
calls and reassignment. But there is no safe way to handle SET_TRUELENGTH
right?
Where is SET_TRUELENGTH
used? https://github.com/search?q=repo%3ARdatatable%2Fdata.table%20SET_TRUELENGTH&type=code
one place is in rbindlist which is a pretty important function for performance.
@Anirban166 can you please construct a PR with a single change SET_TRUELENGTH
-> Rf_lengthgets
in rbindlist along with an atime performance test?
Where is
SET_TRUELENGTH
used?
We use it as part of memrecycle and subsetting so its basically in most operations :D
But we are not alone with this problem https://github.com/r-lib/vctrs/issues/1933
@Anirban166 can you please construct a PR with a single change
SET_TRUELENGTH
->Rf_lengthgets
in rbindlist along with an atime performance test?
Sure, but I'm wondering what the tests will be focused on (since this affects multiple functions) and then what about SETLENGTH
?
AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment.
I don't see lengthgets
mentioned in WRE.
By the same definition, touching S4 objects from C is also non-API! :thinking:
I see, lengthgets
is just what length<-
calls under the hood. But that induces a copy:
a = 1:10
address(a)
# [1] "0x5651b8564b30"
length(a) <- 20
address(a)
# [1] "0x5651b8ab5c68"
It looks like more of our usage has been marked as non-API in the meantime:
Result: NOTE
File ‘data.table/libs/data_table.so’:
Found non-API calls to R: ‘LEVELS’, ‘NAMED’, ‘SETLENGTH’,
‘SET_GROWABLE_BIT’, ‘SET_S4_OBJECT’, ‘SET_TRUELENGTH’,
‘UNSET_S4_OBJECT’
new:
LEVELS()
:
https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/data.table.h#L30-L32 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/bmerge.c#L20
NAMED()
:
https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/data.table.h#L58-L63 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/assign.c#L548-L552
SET_GROWABLE_BIT()
:
https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/data.table.h#L10-L12 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/freadR.c#L535
there is a new section in WRE https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Moving-into-C-API-compliance
@tdhock there is nothing about SETLENGTH
and SET_TRUELENGTH
in this new section. I'm in trouble with that one too. I can remove the feature that uses it but I'm wondering how data.table
team will handle as it is a key feature for data.table. See also https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html
I can imagine group by, when not gforce optimized, is setting true length as well. It allocates memory for a biggest group and then reuses that space for smaller groups as well. Similarly frollapply is doing the same for adaptive=TRUE (possibly not yet in master).
This issue also makes our CI fail since we get now a new 3rd NOTE: Compiled code should not call non-API entry points in R.
Should we alter the CI to allow 3 notes on dev?
Yes, for the moment till we address this issue. CI is used to spot new issues, this one is already known.
After #6312 and #6313 I'll bump the milestone here to the next release, I think we've put in good effort and definitely applied the fixes they recommend in R-exts. The others remain more difficult to satisfy & can be handled later.
Afair growable vectors are part of the public api, and those calls which apparently should be removed now are essential tools to work with growable vectors. That may be some argument to request official support for those.
Would be worth clarifying on r-devel then. Currently, SET_GROWABLE_BIT
is under "Stuff that is not API and probably should not be but is getting used."
And there's no mention in WRE of "growable"
Uh then it is possibly not a public api then. I had that impression when recalling how well it was announced when got implemented.
Have you guys been in touch with R-core about the growable vectors issue? Or do you know of anyone who has been in touch with them regarding this? Seems to me :=
would be impossible without it, you'd have to reimplement chmatch()
using a proper hash table, and also rbindlist()
uses it, and probably other functions as well.
I don't think you mean growable vectors. AFAIU all three functions you mentioned were implemented in DT before growable vector happened in R.
Ok, yes I mean the manipulation of the visible and true vector length with SETLENGTH
and SET_TRUELENGTH
- you use that to overallocate data.tables and make them behave like normal data.frames at the R-level which is effectively creating a growable vector - the vector is fixed but the overallocated part is hidden. In chmatch()
you use the same functions to use the CHARSXP table as a hash table. Anyway, it seems to me :=
wouldn't be possible in an API conform way without this length manipulation - thus my question about R-core engagement.
Over-allocated vectors (including lists since R-4.3.0) can be implemented using ALTREP in something like 60-100 lines of code (separate implementation for each primitive type, unfortunately). I expect it to be hard but not impossible to (1) refactor data.table
's use of SETLENGTH
, TRUELENGTH
, SET_GROWABLE_BIT
into an abstract "over-allocated growable vector" interface and (2) reimplement it in ALTREP #if R_VERSION >= R_Version(4,3,0)
.
Use of SET_TRUELENGTH
in string matching functions as a free integer-sized field in every CHARSXP
will be much harder to get rid of. I'm not seeing any other fields in a CHARSXP
that could be used in API-compliant way for this purpose, which leaves hash tables, which have different performance characteristics. Also, the C standard doesn't guarantee that if (void*)a == (void*)b
, (uintptr_t)a == (uintptr_t)b
will also hold.
It's a nice idea, although it would significantly reduce backwards compatibility by one major version (data.table currently depends R >= 3.3.0) and I'm not sure what consequences ALTREP would have for the interopertability of data.table, particularly for package C code meant to manipulate standard data frames (and for storage as rds and reading if data.table is not loaded)? In any case, if maintainers are unsure how to proceed, perhaps invite R-core to the conversation (e.g. Thomas Kalibera (kalibera) or Luke Tierney (ltierney)). Seems also the vctrs team has not found a solution yet (https://github.com/r-lib/vctrs/issues/1933) so it seems this is not a small issue (and we're talking about probably a third or quarter of CRAN packages that depends on data.table and vctrs, so it must be of concern to R-core that a good solution is found here)
Part of the idea is to keep both implementations and use the original one for R ∈ [3.3.0; 4.3.0) and the ALTREP one for R ≥ 4.3.0. Maintenance burden, yes, but the old code is already quite stable, and the internals of old R versions aren't going to change.
The C code, even if it requests raw data pointers, shouldn't notice anything amiss: since the ALTREP class already carries a full-blown over-allocated object of the same type inside it, it can return its (writeable) data pointer when requested.
Serialization will have similar properties to what we have right now: if an ALTREP class does not implement a Serialize
method, R will serialize the contents of the object as if it was an ordinary value, requiring a setDT()
when it's reloaded. If we really want to, we can implement Serialize
later, so that the ALTREP objects would come out already over-allocated.
An ALTREP data.table could also take care of duplicating itself properly, obviating a whole class of bugs resulting from it being shallowly duplicated by attr<-
and many other R operations.
Sounds Good. It would be ideal of course if other frameworks like collapse don't need to take additional actions to interoperate with data.table. I will stay updated with this (https://github.com/SebKrantz/collapse/issues/653). Happy to assist with the implementation as well, although I haven't used ALTREP yet.
We are getting these NOTEs on CRAN, e.g.
Found non-API calls to R: ‘SETLENGTH’, ‘SET_S4_OBJECT’, ‘SET_TRUELENGTH’, ‘UNSET_S4_OBJECT’
I'm really skeptical about the SETLENGTH and SET_TRUELENGTH parts, which have long been a key underlying feature of data.table internals. I propose we try and keep using those unless CRAN really forces our hands. It be really great if someone wanted to do a performance comparison of the package with/without those calls.
Anyway, it should be easier to remove the non-API S4 calls.
The list does keep changing as this is an actively evolving piece of r-devel. Here's the current list of things we should (eventually) address: