Closed shoebillm closed 3 years ago
tech specs: | OS Name | OS Version | Processor | RAM | Run time | Memory used | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Microsoft Windows 10 Pro | 10.0.18363 | Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz | 8GB | Wall time: 10min 43s; Wall time: 8min 36s (dask) | peak memory: 491.90 MiB, increment: 0.12 MiB; dask: peak memory: 3164.26 MiB, increment: 1989.64 MiB | |||||||
Dask
for loading and writing data is relatively faster than Pandas
, due to the fact that Dask
reads data in chunks and parallel. Overall, compiling such big data into one csv file and load it every single time is not suggested, and alternatively file type should be considered such as feather
or parquet
.OS Name and Version: Windows 10 Education 10.0.18363 Processor: AMD Ryzen 7 4800H 2.9 GHz, Octa Core RAM: 8 GB DDR4 2666 MHz
Observations: Wall time to get the file named data.zip: 2min 10s
Wall time to extract the file: 21.5s
Wall time to join data together using pandas: 5min 53s Pandas memory usage -> peak memory: 432.39 MiB, increment: 0.23 MiB
Wall time to read combined file: 1min 19s
Wall time to join data together using dask: 6min 56s Dask memory usage -> peak memory: 5625.91 MiB, increment: 2447.35 MiB
Wall time to load the column(s) of interest i.e., 'model': 38.7 s Memory usage to load the column(s) of interest i.e., 'model' -> peak memory: 1056.12 MiB, increment: 953.85 MiB
Wall time for loading as chunks: 1min 4s Memory usage for loading as chunks: peak memory: 3367.20 MiB, increment: 2157.85 MiB
Wall time for loading using dask: 43.1 s Memory usage for loading using dask: peak memory: 3802.29 MiB, increment: 2507.14 MiB
Downloading and combining data
My OS specifications:
MacOS Catalina Version 10.15.6
Processor 1.8 GHz Dual-Core Intel Core i5
RAM 8 GB 1600 MHz DDR3
Operation system
I think available RAM influences the downloading process more than the Operating System. I was able to download it faster on a 16GB RAM windows system than my 8GB MacOS.
dask vs pandas
Dask performed more efficiently in terms of memory and speed. It was faster to load the data in chunks.
Wall time (pandas) : 5min 29sec Wall time (dask): 4min 13sec
Memory (pandas): peak memory: 561.40 MiB, increment: 0.35 MiB
Memory (dask): peak memory: 4906.52 MiB, increment: 2452.80 MiB
Downloading and combining data
My OS specifications:
MacOS Catalina Version 10.15.7
Processor 1.4 GHz Quad-Core Intel Core i5
RAM 8 GB 2133 MHz LPDDR3
Operation system
It seems that the Operation system and RAM can have a huge influence on the time spent downloading and combining data. Generally, MacOS tends to run faster as compared to Windows systems. RAM appears to be an important factor to affect the peak memory usage.
dask
vspandas
When the datasets are small, there's no significant difference in time and memory usage between
dask
andpandas
. However, when we worked on much larger datasets,dask
displayed better efficiency in terms of shorter time spent to perform the task and smaller memory usage while writing or reading.EDA
float64
vsfloat32
As compared to
float64
,float32
was able to reduce the memory usage to around 1/2, because it stores numbers in 32-bit format.Loading in chunks &
dask
Loading in chunks and via
dask
produced exactly the same output, and both appeared to be efficient enough. The time spent fordask
method` is slightly shorter compared to loading in chunks and it tends to be more efficient in memory usage as well.