RcppCore / Rcpp

Seamless R and C++ Integration
https://www.rcpp.org
GNU General Public License v2.0
735 stars 211 forks source link

DataFrame.push_back() fails for zero-row columns #1232

Closed jeroen closed 2 years ago

jeroen commented 2 years ago

Downstream bug report: https://github.com/ropensci/pdftools/issues/108 I think the bug is that DataFrame.push_back() gives an error when trying to add a column to a data frame with 0 rows:

DataFrame test (int n) {

  IntegerVector foo(n);
  CharacterVector bar(n);
  CharacterVector baz(n);

  Rcpp::DataFrame df = DataFrame::create(
    _["foo"] = foo,
    _["bar"] = bar
  );
  df.push_back(baz, "baz");
  return df;
}
test(0);
# Error in FUN(X[[i]], ...) : is.data.frame(df) is not TRUE
# In addition: Warning messages:
# 1: In poppler_pdf_data(loadfile(pdf), font_info, opw, upw) :
#   Column sizes are not equal in DataFrame::push_back, object degrading to List
eddelbuettel commented 2 years ago

Thanks, I'll try to take a look, should be possible to cover that case.

On the other hand ... I think that case is known and is simply a 15-year old design want meaning you should

eddelbuettel commented 2 years ago

This (actually complete) example works:

Code


#include <Rcpp/Lightest>

// [[Rcpp::export]]
Rcpp::List listtest (int n) {

  Rcpp::IntegerVector foo(n);
  Rcpp::CharacterVector bar(n);
  Rcpp::CharacterVector baz(n);

  Rcpp::List ll = Rcpp::List::create(Rcpp::Named("foo") = foo,
                                     Rcpp::Named("bar") = bar);
  ll.push_back(baz);
  return ll;
}

/*** R
listtest(0)
*/

Output

> Rcpp::sourceCpp("/tmp/issue1232.cpp")

> listtest(0)
$foo
integer(0)

$bar
character(0)

[[3]]
character(0)

> 

There is a convenience converter (that IIRC actually calls back into R) that can turn that object into a df too:

Added code

// [[Rcpp::export]]
Rcpp::DataFrame dftest (int n) {

  Rcpp::IntegerVector foo(n);
  Rcpp::CharacterVector bar(n);
  Rcpp::CharacterVector baz(n);

  Rcpp::List ll = Rcpp::List::create(Rcpp::Named("foo") = foo,
                                     Rcpp::Named("bar") = bar);
  ll.push_back(baz);
  return Rcpp::DataFrame(ll);
}

Output

> Rcpp::sourceCpp("/tmp/issue1232.cpp")

> #listtest(0)
> dftest(0)
[1] foo          bar          character.0.
<0 rows> (or 0-length row.names)
> 

So maybe the matter is 'merely' the missing name?

Can you live with this 'list til I return' variant?

jeroen commented 2 years ago

Can you live with this 'list til I return' variant?

Yes I can do that, but what is the best way to pass stringsAsFactors = false for this case? Previously I had this:

https://github.com/ropensci/pdftools/blob/cf2099a56b873b43e2f542d9b05185baad5eda99/src/bindings.cpp#L237-L245

But then if we first store a list, the _["stringsAsFactors"] = false part needs to move to Rcpp::DataFrame(ll). Should I use Rcpp::DataFrame::create(ll, _["stringsAsFactors"] = false) ?

eddelbuettel commented 2 years ago

Not having tested this, the second should work.

That echos what I wrote earlier about collecting your dimensions and calling the instantiation at the last step. I am aware of a few packages doing just that, and setting stringsAsFactors that way.

eddelbuettel commented 2 years ago

I am not fully convinced we want to or need to support the use case you have here but the following diff achieves it:

modified   inst/include/Rcpp/DataFrame.h
@@ -140,10 +140,12 @@ namespace Rcpp{
                     max_rows = Rf_xlength(*it);
                 }
             }
-            for (it = Parent::begin(); it != Parent::end(); ++it) {
-                if (Rf_xlength(*it) == 0 || ( Rf_xlength(*it) > 1 && max_rows % Rf_xlength(*it) != 0 )) {
-                    // We have a column that is not an integer fraction of the largest
-                    invalid_column_size = true;
+            if (max_rows > 0) {
+                for (it = Parent::begin(); it != Parent::end(); ++it) {
+                    if (Rf_xlength(*it) == 0 || ( Rf_xlength(*it) > 1 && max_rows % Rf_xlength(*it) != 0 )) {
+                        // We have a column that is not an integer fraction of the largest
+                        invalid_column_size = true;
+                    }
                 }
             }
             if (invalid_column_size) {
Enchufa2 commented 2 years ago

Seems like a reasonable change to me to resemble what happens in R, and therefore to apply the principle of least surprise:

> foo <- integer(0)
> bar <- character(0)
> baz <- character(0)
> df <- data.frame(foo, bar)
> cbind(df, baz)
[1] foo bar baz
<0 rows> (or 0-length row.names)
eddelbuettel commented 2 years ago

I agree. I am inclined to test it. And when you run the example by @jeroen under the diff above you even get a reasonable filled-in name (as push_back does not set one):

edd@rob:~/git/rcpp(master)$ Rscript -e 'Rcpp::sourceCpp("/tmp/issue1232.cpp")'                           

> origissue(0)                                                                                           
[1] foo          bar          character.0.                                                               
<0 rows> (or 0-length row.names)                                                                         
edd@rob:~/git/rcpp(master)$
eddelbuettel commented 2 years ago

As expected, no issues in a reverse-dependency check. I just pushed the branch. I will a new unit test for good measure.