Open vla6 opened 2 years ago
As an update to this one I have an open case 2203020040005244.
I am being told that there is no guarantee of not getting extra rows. This is ok but should be documented. It also requires after the fact de duping which is not always simple on a large dataset.
I also am hoping that the run metrics will reflect the number of batches actually run. The metrics now reflect the mini batches that were supposed to run and so is less than actual.
Using AzureML, ParallelRunStep with a CSV file input, I run Shapley explanations (SHAP package) in parallel on a cluster. I see the actual mini batches run are greater than expected, leading to extra rows in the output file.
The cluster is a 10 node Standard_D64_v3 created 2/28/2022 (I have destroyed and recreated the cluster and seen the same issue).
The metric Total Mini Batches is present by default. I also log the batch size in my script, and in this way I get one metric per mini batch.
Here are my observations:
I work around this by forcing uniqueness in the output file; however this is not ideal for large datasets.