Excessive memory usage in logistic regression

Plozano94 commented 4 years ago

from dask_yarn import YarnCluster
from dask.distributed import Client

# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(environment='s3://openbank-ds-playground/environments/conda/gru13-07.tar.gz',
                      worker_vcores=1,
                      worker_memory="50GiB",
                      deploy_mode='local',
                      dashboard_address=':6689',
                     )

cluster.adapt(minimum=4, maximum=10)
client = Client(cluster)

Hi guys, I've notice a weird behaviour in my dask application. I'm running a Logistic Regression with dask-ml in the YarnCluster I have created above over EMR architecture, and I can see that each worker takes like 15 times the memory of the dataset and i'm specifying to work only with 1 vcore in each worker. I've tested with different dataset sizes and always get into 10-20 times the size of the dataset. The data is loaded from S3 through pandas and s3fs. I can't figure out why is this happening. Could you help me?:

Environment:

Dask version: 2.18.0
dask_yarn version: 0.8.1
Python version: 3.6.10 |Anaconda, Inc.| (default, May 8 2020, 02:54:21) [GCC 7.3.0]

jcrist commented 4 years ago

Hi @Plozano94, I suspect this has more to do with the application (dask-ml logistic regression) than where it's running (dask-yarn). Dask workloads are generally agnostic to the backing cluster manager. I've moved this issue to the dask-ml repo to discuss further (cc @TomAugspurger).

TomAugspurger commented 4 years ago

@Plozano94 can you provide a performance report? https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports

adriankastrau-kinesso commented 4 years ago

@TomAugspurger I've noticed that as well. When I run LR with Random Search with Scikit-learn estimator it runs pretty well on a workers with 14GB RAM. On the other hand, when I run the same process/code but with dask-ml implementation 14GB for each worker wasn't been sufficient to run But I didn't run that on top of YARN, and I used dummy dataset containing 500 features and 600 000 observations of type int8

adriankastrau-kinesso commented 4 years ago

@TomAugspurger any updates on that?

TomAugspurger commented 4 years ago

Nope.

On Mon, Sep 7, 2020 at 5:35 AM Adrian Kastrau notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger any updates on that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/717#issuecomment-688235916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWAM6GBOGMFODMVVFTSESZPTANCNFSM4PU2JIBA .

dask / dask-ml

Excessive memory usage in logistic regression #717