Open Plozano94 opened 4 years ago
Hi @Plozano94, I suspect this has more to do with the application (dask-ml logistic regression) than where it's running (dask-yarn). Dask workloads are generally agnostic to the backing cluster manager. I've moved this issue to the dask-ml repo to discuss further (cc @TomAugspurger).
@Plozano94 can you provide a performance report? https://distributed.dask.org/en/latest/diagnosing-performance.html#performance-reports
@TomAugspurger I've noticed that as well. When I run LR with Random Search with Scikit-learn estimator it runs pretty well on a workers with 14GB RAM. On the other hand, when I run the same process/code but with dask-ml implementation 14GB for each worker wasn't been sufficient to run But I didn't run that on top of YARN, and I used dummy dataset containing 500 features and 600 000 observations of type int8
@TomAugspurger any updates on that?
Nope.
On Mon, Sep 7, 2020 at 5:35 AM Adrian Kastrau notifications@github.com wrote:
@TomAugspurger https://github.com/TomAugspurger any updates on that?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/717#issuecomment-688235916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIWAM6GBOGMFODMVVFTSESZPTANCNFSM4PU2JIBA .
Hi guys, I've notice a weird behaviour in my dask application. I'm running a Logistic Regression with dask-ml in the YarnCluster I have created above over EMR architecture, and I can see that each worker takes like 15 times the memory of the dataset and i'm specifying to work only with 1 vcore in each worker. I've tested with different dataset sizes and always get into 10-20 times the size of the dataset. The data is loaded from S3 through pandas and s3fs. I can't figure out why is this happening. Could you help me?:
Environment: