ParallelRunStep with CSV input introduces duplicate rows

Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft

MIT License

4.11k stars 2.52k forks source link

Using AzureML, ParallelRunStep with a CSV file input, I run Shapley explanations (SHAP package) in parallel on a cluster. I see the actual mini batches run are greater than expected, leading to extra rows in the output file.

The cluster is a 10 node Standard_D64_v3 created 2/28/2022 (I have destroyed and recreated the cluster and seen the same issue).

The metric Total Mini Batches is present by default. I also log the batch size in my script, and in this way I get one metric per mini batch.
Here are my observations:

The input CSV has no duplicate row IDs
The file produced by ParallelRunStep has duplicate IDs
The run completes successfully
The number of the batch size entries is greater than total mini batches expected. The difference between these 2 is usually the exact number of extra rows.

I work around this by forcing uniqueness in the output file; however this is not ideal for large datasets.

Azure / MachineLearningNotebooks

ParallelRunStep with CSV input introduces duplicate rows #1703