Observations and Reasonings

shoebillm commented 3 years ago

Downloading and combining data

My OS specifications:

MacOS Catalina Version 10.15.7

Processor 1.4 GHz Quad-Core Intel Core i5

RAM 8 GB 2133 MHz LPDDR3

Operation system

It seems that the Operation system and RAM can have a huge influence on the time spent downloading and combining data. Generally, MacOS tends to run faster as compared to Windows systems. RAM appears to be an important factor to affect the peak memory usage.

dask vs pandas

When the datasets are small, there's no significant difference in time and memory usage between dask and pandas. However, when we worked on much larger datasets, dask displayed better efficiency in terms of shorter time spent to perform the task and smaller memory usage while writing or reading.

EDA

float64 vs float32

As compared to float64, float32 was able to reduce the memory usage to around 1/2, because it stores numbers in 32-bit format.

Loading in chunks & dask

Loading in chunks and via dask produced exactly the same output, and both appeared to be efficient enough. The time spent for dask method` is slightly shorter compared to loading in chunks and it tends to be more efficient in memory usage as well.

jianructose commented 3 years ago

tech specs:	OS Name	OS Version	Processor	RAM	Run time
Microsoft Windows 10 Pro	10.0.18363	Intel(R) Core(TM) i5-10210U CPU @ 1.60GHz	8GB	Wall time: 10min 43s; Wall time: 8min 36s (dask)	peak memory: 491.90 MiB, increment: 0.12 MiB; dask: peak memory: 3164.26 MiB, increment: 1989.64 MiB

The runtime and memory usages can vary by the type of machines and operation systems, but relatively long time and fairly large memory are required for such big-sized data. Dask for loading and writing data is relatively faster than Pandas, due to the fact that Dask reads data in chunks and parallel. Overall, compiling such big data into one csv file and load it every single time is not suggested, and alternatively file type should be considered such as featheror parquet.

adibns commented 3 years ago

OS Name and Version: Windows 10 Education 10.0.18363 Processor: AMD Ryzen 7 4800H 2.9 GHz, Octa Core RAM: 8 GB DDR4 2666 MHz

Observations: Wall time to get the file named data.zip: 2min 10s

Wall time to extract the file: 21.5s

Wall time to join data together using pandas: 5min 53s Pandas memory usage -> peak memory: 432.39 MiB, increment: 0.23 MiB

Wall time to read combined file: 1min 19s

Wall time to join data together using dask: 6min 56s Dask memory usage -> peak memory: 5625.91 MiB, increment: 2447.35 MiB

Wall time to load the column(s) of interest i.e., 'model': 38.7 s Memory usage to load the column(s) of interest i.e., 'model' -> peak memory: 1056.12 MiB, increment: 953.85 MiB

Wall time for loading as chunks: 1min 4s Memory usage for loading as chunks: peak memory: 3367.20 MiB, increment: 2157.85 MiB

Wall time for loading using dask: 43.1 s Memory usage for loading using dask: peak memory: 3802.29 MiB, increment: 2507.14 MiB

AishwaryaGopal12 commented 3 years ago

Downloading and combining data

My OS specifications:

MacOS Catalina Version 10.15.6

Processor 1.8 GHz Dual-Core Intel Core i5

RAM 8 GB 1600 MHz DDR3

Operation system

I think available RAM influences the downloading process more than the Operating System. I was able to download it faster on a 16GB RAM windows system than my 8GB MacOS.

dask vs pandas

Dask performed more efficiently in terms of memory and speed. It was faster to load the data in chunks.

AishwaryaGopal12 commented 3 years ago

Wall time (pandas) : 5min 29sec Wall time (dask): 4min 13sec

Memory (pandas): peak memory: 561.40 MiB, increment: 0.35 MiB

Memory (dask): peak memory: 4906.52 MiB, increment: 2452.80 MiB

UBC-MDS / majacloud

Observations and Reasonings #14