ARS-toscana / CreateSpells

Other
1 stars 0 forks source link

Improve int/char list transformation to date format #4

Closed DavideMessinaARS closed 3 years ago

DavideMessinaARS commented 3 years ago

I've run the function on a dataset created by combining OBSERVATION_PERSON 10000 times. (A final length of around 30M records)

The general results are

expression min median itr/sec mem_alloc gc/sec n_itr n_gc total_time
Modified_createspell() 1.86m 1.92m 0.008679 29.7GB 0.085054 5 49 9.6m

The result from profiling the code are

function time
total 78.45s
lubridate::ymd 54.44s
[.data.table 11.49s
order 2.45s
others ...

Since ymd() is the function with the most impact right now, it make sense to explore other alternatives. (For now at least, since I haven't run CreateSpell on a real dataset)

DavideMessinaARS commented 3 years ago

Running different functions on two vectors of 1M elements, one integer and the other character, gives:

base_str = base::as.Date(chr_dates),
basef_str = base::as.Date(chr_dates, fmt),
lub1_str = lubridate::as_date(chr_dates),
lub2_str = lubridate::ymd(chr_dates),
lub2_int = lubridate::ymd(num_dates),
idat_str = data.table::as.IDate(chr_dates),
idatf_str = data.table::as.IDate(chr_dates, fmt),
fast_ = fasttime::fastPOSIXct(chr_dates),
fastd = as.Date(fasttime::fastPOSIXct(chr_dates))
expression min median itr/sec mem_alloc gc/sec n_itr n_gc total_time
base_str 11.16s 11.51s 0.0865606 118.3MB 0.1731212 10 20 1.92m
basef_str 1.03s 1.12s 0.8881133 83.9MB 0.8881133 10 10 11.26s
lub1_str 386.46ms 425.67ms 2.3032683 128.3MB 3.4549025 10 15 4.34s
lub2_str 398.74ms 490.69ms 1.9532972 134.6MB 3.3206053 10 17 5.12s
lub2_int 1.7s 1.82s 0.5359279 241.4MB 1.5005982 10 28 18.66s
idat_str 10.94s 11.2s 0.0892033 122.1MB 0.3568132 3 12 33.63s
idatf_str 965.49ms 992.24ms 0.9820532 87.7MB 1.4730798 4 6 4.07s
fast_ 88.63ms 102.25ms 9.5453584 15.3MB 0.0000000 10 0 1.05s
fastd 108.85ms 120.77ms 8.1918708 30.5MB 2.0479677 8 2 976.58ms
DavideMessinaARS commented 3 years ago

As I've expected running on a real dataset gives a much different result.

On improving with respect to the real bottlenecks see https://github.com/ARS-toscana/CreateSpells/pull/5/

This is no longer a relevant topic so I'll close it.