Closed rabutler closed 6 years ago
It is the following new statement that adds the overhead:
rdf_tmp <- read_rdf_header(rdf.mat, rdf.obj$position, "END_COLUMN")
current solution
I think we can improve further by removing all of the previous positions each time this is called, since this still takes the majority of the time in the function call.
ec_i <- end_col_i[Position(function(x) x > rdf.obj$position, end_col_i)] + 1
Older tries
Tried the following instead, but it's actually much slower:
find_next_keyword <- function(rdf_mat, cur_pos, keyword)
{
match(keyword, rdf_mat[cur_pos:length(rdf_mat)]) + cur_pos - 1
}
find_next_keyword(rdf.mat, rdf.obj$position, "END_COLUMN")
microbenchmark::microbenchmark()
results:
New Laptop:
expr min lq mean median uq max neval
read.rdf2 1.267439 1.405701 1.532922 1.494721 1.615299 2.13648 20
read_rdf 23.995 27.09833 28.59642 28.50726 29.75776 33.81968 20
read_rdf(v2) 10.70988 11.10502 11.47762 11.49724 11.69881 12.95315 20
read_rdf(v3) 3.368835 3.421129 3.60164 3.51044 3.713032 4.203365 20
read_rdf(v4) .7242288 .7505287 .8127688 .8031222 .8654116 .9458311 20
read.rdf2()
= version from RWDataPlyr v0.5.0
read_rdf()
= first version for addressing this issue
read_rdf(v2)
= added fixed = TRUE to strsplit and using new RStudio
read_rdf(v3)
= added single call to which
and use Position()
to find first match
read_rdf(v4)
= now remove the END_OF_COLUMN indeces after they are used so Position
has to search less before finding the next index
Reopening b/c previous fix added a ton of overhead to
read.rdf()