Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.63k stars 987 forks source link

Remove non-API calls in C #6180

Open MichaelChirico opened 5 months ago

MichaelChirico commented 5 months ago

We are getting these NOTEs on CRAN, e.g.

Found non-API calls to R: ‘SETLENGTH’, ‘SET_S4_OBJECT’, ‘SET_TRUELENGTH’, ‘UNSET_S4_OBJECT’

I'm really skeptical about the SETLENGTH and SET_TRUELENGTH parts, which have long been a key underlying feature of data.table internals. I propose we try and keep using those unless CRAN really forces our hands. It be really great if someone wanted to do a performance comparison of the package with/without those calls.

Anyway, it should be easier to remove the non-API S4 calls.

The list does keep changing as this is an actively evolving piece of r-devel. Here's the current list of things we should (eventually) address:

tdhock commented 5 months ago

there is a web page which lists all of the API calls, grouped into 3 categories https://yutannihilation.github.io/R-fun-API/ see also the R-devel thread which is linked from that page

ecoRoland2 commented 5 months ago

You should probably be proactive and contact R-core regarding your need of an API to SETLENGTH and SET_TRUELENGTH. I would hate seeing slower performance in data.table. It could cause some difficult decisions at my day job.

MichaelChirico commented 5 months ago

if this is important for you (or anyone else reading) please take the time to construct a performance comparison.

ecoRoland2 commented 5 months ago

I would but unfortunately I think you need at least basic knowledge of C to do that.

ben-schwen commented 5 months ago

AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment. But there is no safe way to handle SET_TRUELENGTH right?

tdhock commented 5 months ago

Where is SET_TRUELENGTH used? https://github.com/search?q=repo%3ARdatatable%2Fdata.table%20SET_TRUELENGTH&type=code one place is in rbindlist which is a pretty important function for performance. @Anirban166 can you please construct a PR with a single change SET_TRUELENGTH -> Rf_lengthgets in rbindlist along with an atime performance test?

ben-schwen commented 5 months ago

Where is SET_TRUELENGTH used?

We use it as part of memrecycle and subsetting so its basically in most operations :D

But we are not alone with this problem https://github.com/r-lib/vctrs/issues/1933

Anirban166 commented 5 months ago

@Anirban166 can you please construct a PR with a single change SET_TRUELENGTH -> Rf_lengthgets in rbindlist along with an atime performance test?

Sure, but I'm wondering what the tests will be focused on (since this affects multiple functions) and then what about SETLENGTH?

MichaelChirico commented 5 months ago

AFAIU we can replace SETLENGTH calls with Rf_lengthgets calls and reassignment.

I don't see lengthgets mentioned in WRE.

By the same definition, touching S4 objects from C is also non-API! :thinking:

MichaelChirico commented 5 months ago

I see, lengthgets is just what length<- calls under the hood. But that induces a copy:

a = 1:10
address(a)
# [1] "0x5651b8564b30"
length(a) <- 20
address(a)
# [1] "0x5651b8ab5c68"
MichaelChirico commented 5 months ago

It looks like more of our usage has been marked as non-API in the meantime:

Result: NOTE 
  File ‘data.table/libs/data_table.so’:
    Found non-API calls to R: ‘LEVELS’, ‘NAMED’, ‘SETLENGTH’,
      ‘SET_GROWABLE_BIT’, ‘SET_S4_OBJECT’, ‘SET_TRUELENGTH’,
      ‘UNSET_S4_OBJECT’

new:

LEVELS():

https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/data.table.h#L30-L32 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/bmerge.c#L20

NAMED():

https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/data.table.h#L58-L63 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/assign.c#L548-L552

SET_GROWABLE_BIT():

https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/data.table.h#L10-L12 https://github.com/Rdatatable/data.table/blob/fab93a93979997aae6066a219e6f3f2c7168e707/src/freadR.c#L535

tdhock commented 5 months ago

there is a new section in WRE https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Moving-into-C-API-compliance

Jean-Romain commented 5 months ago

@tdhock there is nothing about SETLENGTH and SET_TRUELENGTH in this new section. I'm in trouble with that one too. I can remove the feature that uses it but I'm wondering how data.table team will handle as it is a key feature for data.table. See also https://stat.ethz.ch/pipermail/r-devel/2024-June/083449.html

jangorecki commented 5 months ago

I can imagine group by, when not gforce optimized, is setting true length as well. It allocates memory for a biggest group and then reuses that space for smaller groups as well. Similarly frollapply is doing the same for adaptive=TRUE (possibly not yet in master).

ben-schwen commented 5 months ago

This issue also makes our CI fail since we get now a new 3rd NOTE: Compiled code should not call non-API entry points in R. Should we alter the CI to allow 3 notes on dev?

jangorecki commented 5 months ago

Yes, for the moment till we address this issue. CI is used to spot new issues, this one is already known.

MichaelChirico commented 4 months ago

After #6312 and #6313 I'll bump the milestone here to the next release, I think we've put in good effort and definitely applied the fixes they recommend in R-exts. The others remain more difficult to satisfy & can be handled later.

jangorecki commented 3 months ago

Afair growable vectors are part of the public api, and those calls which apparently should be removed now are essential tools to work with growable vectors. That may be some argument to request official support for those.

MichaelChirico commented 3 months ago

Would be worth clarifying on r-devel then. Currently, SET_GROWABLE_BIT is under "Stuff that is not API and probably should not be but is getting used."

https://github.com/r-devel/r-svn/blob/f4e98ffcff71d5d04f60ab3201d58c32b4b6e944/src/include/Rinternals.h#L1236

And there's no mention in WRE of "growable"

jangorecki commented 3 months ago

Uh then it is possibly not a public api then. I had that impression when recalling how well it was announced when got implemented.

SebKrantz commented 1 month ago

Have you guys been in touch with R-core about the growable vectors issue? Or do you know of anyone who has been in touch with them regarding this? Seems to me := would be impossible without it, you'd have to reimplement chmatch() using a proper hash table, and also rbindlist() uses it, and probably other functions as well.

jangorecki commented 1 month ago

I don't think you mean growable vectors. AFAIU all three functions you mentioned were implemented in DT before growable vector happened in R.

SebKrantz commented 1 month ago

Ok, yes I mean the manipulation of the visible and true vector length with SETLENGTH and SET_TRUELENGTH - you use that to overallocate data.tables and make them behave like normal data.frames at the R-level which is effectively creating a growable vector - the vector is fixed but the overallocated part is hidden. In chmatch() you use the same functions to use the CHARSXP table as a hash table. Anyway, it seems to me := wouldn't be possible in an API conform way without this length manipulation - thus my question about R-core engagement.

aitap commented 1 month ago

Over-allocated vectors (including lists since R-4.3.0) can be implemented using ALTREP in something like 60-100 lines of code (separate implementation for each primitive type, unfortunately). I expect it to be hard but not impossible to (1) refactor data.table's use of SETLENGTH, TRUELENGTH, SET_GROWABLE_BIT into an abstract "over-allocated growable vector" interface and (2) reimplement it in ALTREP #if R_VERSION >= R_Version(4,3,0).

Use of SET_TRUELENGTH in string matching functions as a free integer-sized field in every CHARSXP will be much harder to get rid of. I'm not seeing any other fields in a CHARSXP that could be used in API-compliant way for this purpose, which leaves hash tables, which have different performance characteristics. Also, the C standard doesn't guarantee that if (void*)a == (void*)b, (uintptr_t)a == (uintptr_t)b will also hold.

SebKrantz commented 1 month ago

It's a nice idea, although it would significantly reduce backwards compatibility by one major version (data.table currently depends R >= 3.3.0) and I'm not sure what consequences ALTREP would have for the interopertability of data.table, particularly for package C code meant to manipulate standard data frames (and for storage as rds and reading if data.table is not loaded)? In any case, if maintainers are unsure how to proceed, perhaps invite R-core to the conversation (e.g. Thomas Kalibera (kalibera) or Luke Tierney (ltierney)). Seems also the vctrs team has not found a solution yet (https://github.com/r-lib/vctrs/issues/1933) so it seems this is not a small issue (and we're talking about probably a third or quarter of CRAN packages that depends on data.table and vctrs, so it must be of concern to R-core that a good solution is found here)

aitap commented 1 month ago

Part of the idea is to keep both implementations and use the original one for R ∈ [3.3.0; 4.3.0) and the ALTREP one for R ≥ 4.3.0. Maintenance burden, yes, but the old code is already quite stable, and the internals of old R versions aren't going to change.

The C code, even if it requests raw data pointers, shouldn't notice anything amiss: since the ALTREP class already carries a full-blown over-allocated object of the same type inside it, it can return its (writeable) data pointer when requested.

Serialization will have similar properties to what we have right now: if an ALTREP class does not implement a Serialize method, R will serialize the contents of the object as if it was an ordinary value, requiring a setDT() when it's reloaded. If we really want to, we can implement Serialize later, so that the ALTREP objects would come out already over-allocated.

An ALTREP data.table could also take care of duplicating itself properly, obviating a whole class of bugs resulting from it being shallowly duplicated by attr<- and many other R operations.

SebKrantz commented 1 month ago

Sounds Good. It would be ideal of course if other frameworks like collapse don't need to take additional actions to interoperate with data.table. I will stay updated with this (https://github.com/SebKrantz/collapse/issues/653). Happy to assist with the implementation as well, although I haven't used ALTREP yet.