dmlc / xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
https://xgboost.readthedocs.io/en/stable/
Apache License 2.0
26.19k stars 8.71k forks source link

Non deterministic results with dask when using version 1.6.2 #8701

Closed KaeganCasey closed 1 year ago

KaeganCasey commented 1 year ago

Hello, When using xgboost with dask I am getting non-deterministic results even when I set the random_state parameter. This is very similar to #7927 which says it was fixed with a merge on Aug 11, 2022. I was assuming since version 1.6.2 was released on Aug 23, 2022 it would be fixed in this version but every time I fit xgboost to the same dask array I get different feature importance and AUC scores. I am using distributed computation on a kubernetes cluster for this experimentation. Due to other required packages I can only go up to version 1.6.2 for xgboost but please let me know if this has been fixed in any later version or if I am mistaken on anything I have said.

It is difficult for me to provide an example because we use an outside companies software to start up the dask cluster through an API but any distributed dask fitting should show the same result.

Summary information: xgboost version: 1.6.2 dask version: 2022.2.0 distributed version: 2021.11.2

trivialfis commented 1 year ago

it takes some care to get deterministic results using dask. The issue you referred to is specific to the local GPU cluster where there's a single host and multiple GPUs occupying the same host address. With or without the fix, xgboost should be deterministic on other setups given data input is deterministic.

In essence, one needs to ensure the data partitioning is exactly the same for each and every run.

KaeganCasey commented 1 year ago

Hi @trivialfis
Thank you for the explanation!! I am using CPU so you are correct that it is a different issue but your explanation has at least helped me understand that it is possible with care.

Would you have any recommendations about how to ensure the partitioning is the exact same? I am repartitioning the dask dataframe before I fit my model but I am having it automatically partition the data into 500 MB chunks. I could change this to a specific number for the partitions if that would help to ensure consistency.

Any help you can offer is much appreciated!! I can also post on the dask forum. Thank you!

KaeganCasey commented 1 year ago

Hi @trivialfis, Are you sure that if the partitions are exactly the same then xgboost will be deterministic? I was able to print out the number of rows for each of the partitions for two separate executions for my process ("X_train partition summary" in each of the screenshots)

run1: image run2: image

Here you can see the AUC for the model is different even though the input rows for each of the partitions is exactly the same. These partitions have been sorted on a datetime index then repartitioned and produced the same number of output rows so I would expect their content to be identical.

I am hoping that there is still something I am doing wrong since for dask/xgboost to be practical for machine learning I do think it is very important for the results to be reproducible.

Thank you so much for any help or clarification you can offer! Please let me know if I am mistaken in anything that I have said.

trivialfis commented 1 year ago

let me try to reproduce it later. I'm currently on holiday. Could you please share a reproducible example that I can run? If no then could you please check the models generated by each training session are the same by:

clf.save_model("model-{iter}.json")
sha256sum *.json

Are you sure that if the partitions are exactly the same then xgboost will be deterministic?

So far, yes. We run benchmarks and tests. But you have opened an issue on this topic so we will have to double-check. Also, 0.49 AUC seems wrong.

KaeganCasey commented 1 year ago

Hi @trivialfis, Thank you for taking the time to respond previously even while on your holiday, I really appreciate it! I have made significant improvements in my understanding of what is happening and would like to leave some of my findings here for you when you return or anyone else who might be able to offer insight into what I am seeing.

TLDR: So far what I am seeing is that shutting down the dask cluster in between runs helps ensure reproducibility when reading from a parquet file and doing no preprocessing on the data. Once I start querying data directly or including more complicated preprocessing, even though each partition of the dask dataframe is the exact same from run to run (same order of rows etc.), the results keep changing (results = xgboost feature importance and AUC score).

Some Theories: Could this potentially be caused by data/partitions being allocated to different workers through the querying / preprocessing step every time? Or something having to do with work stealing?

Reproducible Example and Responses: It is difficult for me to make a reproducible example for you because I cannot share the exact code and it involves querying data from trino as delayed pandas objects then loading them into a dask dataframe using from_delayed(). This process is essentially what they mention here. We are also using a company that hosts dask as a service so it is difficult for me to know exactly what it's configuration is on top of kubernetes.

could you please check the models generated by each training session are the same

When I compare the model files using your sha256sum method they indeed are different even though the partitions for the dataframe are the exact same ( see below for more info).

Also, 0.49 AUC seems wrong.

I am running the process with a small dataset while I try to understand what is causing variation from run to run and this dataset is very unbalanced which is likely leading to poor performance.

More Information: Over the past week I have been saving the dask dataframe to parquet files at each step in my process and comparing the partitions to one another to ensure they are exactly the same. The order of rows within each partition are exactly the same and I am still getting varying results.

I was able to make a small reproducible example where instead of querying the data, preprocessing it, then fitting the model I am reading the preprocessed data from a parquet file then fitting the model. This helped me understand one type of variance in the xgboost results. If I didn't shutdown the dask cluster in between runs the results would be different every time. Once I started shutting down the cluster in between runs the results for this small example became deterministic.

I then revisited my more complicated process where I am querying the data and preprocessing it. I tried shutting the cluster down between runs and there is still variation in the results. I am now trying to isolate what step of my process is creating this variation compared to the simpler example. To start I stopped querying the data and instead saved the data from the query to a parquet file and am reading from that instead. This did reduce the variation in the sense that the results now oscillate between a handful of options. I am now trying to isolate or remove parts of the preprocessing to find out where more of this could be coming from.

In Summary I'm sorry that I cannot provide a reproducible example at this time I know that it makes it almost impossible for you to help diagnose the problem. I will continue to brainstorm a way to create an example if possible. I am wondering what you may think of my potential theories as to why this might be happening.

Thank you so much for all of your help and I hope you are enjoying your holiday!!

trivialfis commented 1 year ago

Thank you for sharing the detailed information! That provides lots of insight into the issue.

Indeed, getting results to be deterministic sometimes can be quite difficult if not impossible. I have some observations myself during other workflows:

In the end, it's a delicate process to ensure reproducibility on distributed/parallel systems.

KaeganCasey commented 1 year ago

Hi @trivialfis, Thank you so much for your response and these pointers!!! I will give this all a try. Would you be able to elaborate a little bit on what you mean when you say caching the input to xgboost? Do you just mean writing your data to disk and then reading it back in to fit the model? Or is there an internal xgboost mechanism for caching data like how you describe.

I am currently persisting the dask array into cluster memory before fitting xgboost if that is helpful information.

I cannot thank you enough for giving me more options to at least try out! If I can't get it to work then it might just be the reality for this process.

trivialfis commented 1 year ago

Do you just mean writing your data to disk and then reading it back in to fit the model?

Yup, sometimes I cache the ETL result to avoid repeating potential non-deterministic operations

trivialfis commented 1 year ago

Closing as this is a feature request for dask instead of XGBoost. The non-deterministic behaviour comes from data distribution.