Open yihezkel opened 2 years ago
We discussed this and refined this further:
int transientFailureRetries
whose default is 3, and setting to 0 would result in the existing behaviorBefore we do this, we would need to investigate which kinds of exceptions might be thrown to identify which are transient.
This bug is not high priority because the effect of the bug is:
We should also document user best practices of catching exceptions and retrying appropriately.
Reenactment/Timeline
Issues
There are 2 issues here, at least 1 of which has something of a workaround:
Expected behavior
Instead, the Connector should:
Substantiation and more details
This happens rarely among all clusters, except for a few clusters, and further, the amount of Spark ingestion done by a cluster doesn’t seem to correlate with the number of these issues the cluster encounters. This confirms that the proximate cause is the service having health issues and sending an error response to the Connector, causing it to give up. Comparing the clusters facing this same issue:
let sources = Usage | where KustoClient has_cs "Kusto.Spark.Connector" | project Source; KustoLogs | where Source in (sources) and Level == "Error" and EventText startswith_cs "Table 'sparkTempTable" and EventText has_cs "could not be found" and Timestamp > ago(7d) | project Source, Timestamp, Directory, ActivityType, EventText=substring(EventText, 0, 50) | summarize count(), min(Timestamp), max(Timestamp), take_any(EventText) by Source | top 15 by count_
To the clusters that ingest via the Spark connector:
Usage | where KustoClient has_cs "Kusto.Spark.Connector" and Text startswith_cs ".move " and Text contains_cs "sparkTempTable" | summarize count(), Text=take_any(substring(Text, 0, 50)) by Source | top 30 by count_
From
KustoLogs | where (Source in ("ADXTPMSDEVADXCLUSTER", "INGEST-ADXTPMSDEVADXCLUSTER") and Timestamp between (datetime(2022-04-06 09:34:03.9231199) .. datetime(2022-04-06 11:37:19.2406867))) or (Source in ("HOSTAXISKUSTOPDNS", "INGEST-HOSTAXISKUSTOPDNS") and Timestamp between (datetime(2022-04-06 03:17:40.6338220) .. datetime(2022-04-07 07:03:03.4836465))) | where Level == "Error" and EventText !contains "TableNotFoundException" and EventText !contains "FabricManager" and EventText !contains "Cannot access a disposed object." and SourceId !in ("5F76F3E6", "9EE66924") | summarize count(), EventText=take_any(EventText) by SourceId, Source | order by count_ desc // | top 4 by count_
We see these clusters facing these issues are all sorts of network, throttling and storage exceptions. So the connector gets such a response from the service, and then immediately gives up and deletes the destination table.