read.rdf cannot handle rdfs with scalar slots

rabutler commented 6 years ago

Reopening b/c previous fix added a ton of overhead to read.rdf()

rabutler commented 6 years ago

It is the following new statement that adds the overhead:

rdf_tmp <- read_rdf_header(rdf.mat, rdf.obj$position, "END_COLUMN")

current solution

I think we can improve further by removing all of the previous positions each time this is called, since this still takes the majority of the time in the function call.

ec_i <- end_col_i[Position(function(x) x > rdf.obj$position, end_col_i)] + 1

Older tries

Tried the following instead, but it's actually much slower:

find_next_keyword <- function(rdf_mat, cur_pos, keyword)
{
  match(keyword, rdf_mat[cur_pos:length(rdf_mat)]) + cur_pos - 1
}

find_next_keyword(rdf.mat, rdf.obj$position, "END_COLUMN")

rabutler commented 6 years ago

microbenchmark::microbenchmark() results:

New Laptop:

      expr      min       lq     mean   median       uq     max neval
 read.rdf2    1.267439 1.405701 1.532922 1.494721 1.615299 2.13648    20
 read_rdf    23.995 27.09833 28.59642 28.50726 29.75776 33.81968    20
read_rdf(v2) 10.70988 11.10502 11.47762 11.49724 11.69881 12.95315    20
read_rdf(v3) 3.368835 3.421129 3.60164 3.51044 3.713032 4.203365    20
read_rdf(v4) .7242288 .7505287 .8127688 .8031222 .8654116 .9458311    20

read.rdf2() = version from RWDataPlyr v0.5.0 read_rdf() = first version for addressing this issue read_rdf(v2) = added fixed = TRUE to strsplit and using new RStudio read_rdf(v3) = added single call to which and use Position() to find first match read_rdf(v4) = now remove the END_OF_COLUMN indeces after they are used so Position has to search less before finding the next index

BoulderCodeHub / RWDataPlyr

read.rdf cannot handle rdfs with scalar slots #52