databricks / terraform-provider-databricks

Databricks Terraform Provider
https://registry.terraform.io/providers/databricks/databricks/latest
Other
445 stars 384 forks source link

[ISSUE] Issue with databricks cluster creation in the workspace. #2325

Open sivaprasad-cs opened 1 year ago

sivaprasad-cs commented 1 year ago

Configuration

data "databricks_node_type" "m4xlarge-general-purpose" {
  category      = "General Purpose"
  min_cores     = 4
  min_memory_gb = 16
}

# Creating Immuta Testing Cluster
resource "databricks_cluster" "immuta-testing-cluster-2" {
  cluster_name            = "Immuta Testing Cluster 2"
  spark_version           = data.databricks_spark_version.latest-lts.id
  node_type_id            = data.databricks_node_type.m4xlarge-general-purpose.id
  autotermination_minutes = 30
  is_pinned               = true
  custom_tags             = { Terraform = "true", ResourceClass = "Serverless", Environment = var.env_name }
  enable_elastic_disk     = true

  autoscale {
    min_workers = 1
    max_workers = 2
  }

  aws_attributes {
    instance_profile_arn   = "arn:aws:iam::293149407579:instance-profile/LoopioDatabricksPassRole/LoopioClusterProfile-WorkflowsTesting"
    availability           = "SPOT_WITH_FALLBACK"
    zone_id                = "auto"
    first_on_demand        = 1
    spot_bid_price_percent = 100
  }

  spark_conf = {
    "spark.databricks.hive.metastore.glueCatalog.enabled" : true,
    "spark.databricks.repl.allowedLanguages" : "python,sql,r",
    "spark.databricks.cluster.profile" : "serverless",
  }
}

Expected Behavior

The instance type should be m4.xlarge

Actual Behavior

The customer was running above terraform code for deployment. two wek back they were getting m4.xlarge instance type in the deployment. But now they are getting m-fleet. The customerwas asking

why this change in the deplyment and what changes has been done during this period How does this selection happening.

Steps to Reproduce

Run terraform plan using above code

Terraform and provider versions

version = "1.16.0"

Debug Output

\"is_io_cache_enabled\": false,\n      \"memory_mb\": 131072,\n      \"node_instance_type\": {\n        \"instance_family\": \"EC2 c7g Graviton Family vCPUs\",\n        \"instance_type_id\": \"c7g.16xlarge\",\n        \"is_encrypted_in_transit\": false,\n        \"is_graviton\": true,\n        \"is_virtual\": false,\n        \"local_disk_size_gb\": 0,\n        \"local_disks\": 0,\n        \"swap_size\": \"10g\"\n      },\n      \"node_type_id\": \"c7g.16xlarge\",\n      \"num_cores\": 64,\n      \"num_gpus\": 0,\n      \"photon_driver_capable\": false,\n      \"photon_worker_capable\": false,\n      \"require_fabric_manager\": false,\n      \"support_cluster_tags\": true,\n      \"support_ebs_volumes\": true,\n      \"support_port_forwarding\": true\n    },\n    {\n      \"category\": \"Compute Optimized\",\n      \"description\": \"c6i.xlarge\",\n      \"display_order\": 0,\n      \"instance_type_id\": \"c6i.xlarge\",\n      \"is_deprecated\": false,\n      \"is_encrypted_in_transit\": true,\n      \"is_graviton\": false,\n      \"is_hidden\": false,\n      \"is_io_cache_enabled\": false,\n      \"memory_mb\": 8192,\n      \"node_instance_type\": {\n        \"instance_family\": \"EC2 c6i Family vCPUs\",\n        \"instance_type_id\": \"c6i.xlarge\",\n        \"is_encrypted_in_transit\": true,\n        \"is_graviton\": false,... (105978 more bytes) \u003c- GET /api/2.0/clusters/list-node-types","@timestamp":"2023-05-17T11:56:11.270061-04:00"}
**data.databricks_node_type.m4xlarge-general-purpose: Read complete after 0s [id=m-fleet.xlarge]**
2023-05-17T11:56:11.285-0400 [INFO]  ReferenceTransformer: reference not found: "local.workflows_testing_instance_profile_arn"
2023-05-17T11:56:11.285-0400 [DEBUG] ReferenceTransformer: "databricks_cluster.immuta-testing-cluster" references: []
databricks_cluster.immuta-testing-cluster: Refreshing state... [id=0516-152135-55rdihr3]
2023-05-17T11:56:11.291-0400 [DEBUG] provider.terraform-provider-databricks_v0.5.2: GET /api/2.0/clusters/get: 
alexott commented 1 year ago

Hmmm, I can't reproduce it:

data "databricks_node_type" "this" {
  category = "General Purpose"
  min_cores = 4
  min_memory_gb = 16
}

output "node" {
  value = data.databricks_node_type.this.id
}

gives correct result:

Outputs:

node = "m4.xlarge"

Can you post the bigger log piece collected as follows:

TF_LOG=DEBUG DATABRICKS_DEBUG_TRUNCATE_BYTES=250000 terraform apply -auto-approve -no-color 2>&1 |tee 1.log
iandexter commented 1 year ago

@sivaprasad-cs -- Can you ask them to use the latest provider? From the log snippet above, it seems they're using an old version:

provider.terraform-provider-databricks_v0.5.2

alexott commented 1 year ago

ah, yes - they need to be on at least version 1.9.2

sivaprasad-cs commented 1 year ago

Hi Ian / Alex,

I have asked the customer to test it in new provider version, The customer has following question, could you please help on those,

have a few questions.

As mentioned, "The fleet instance type is in GA now. That is the reason you are getting the m-fleet instance type during the terraform plan."

Are you implying that it is now being imposed on users by default to get fleet instances when using databricks_node_type to create clusters?

Are you saying users cannot opt out of using fleet instances if they want?

As a reminder, my original inquiry is, why are we getting fleet instances by default? Is there a setting to ensure we do not get fleet instances if we don't want it?

Regarding the observation of using an older version of the databricks provider, we have not been able to switch the provider version as we encountered errors in installing the provider (switching from databrickslab/databricks to databricks/databricks).

Is your diagnosis that the older provider is causing us to get fleet instances? If anything, the older version of the provider should not affect new changes to how databricks providers work. Is that any acceptable assumption?

alexott commented 1 year ago

the data source works by pulling all available instance types via REST API and filtering them by provided arguments. The fleet instances are part of the REST API output, and also compared to the provided arguments. Latest versions of Databricks provider have a special flag for explicit selection of the fleet instances, but old versions of provider doesn't support that flag, so any matching instance is selected that best matches to provided parameters. Same situation was with the graviton instances for provider versions < 0.5.0

If they don't want to upgrade, then they just need to use string names of instances instead of relying on the data source. But they really need to upgrade as officially, only versions starting with 1.0 are supported. If there are problems with upgrade, account team needs to involve specialist for helping with upgrade

iandexter commented 1 year ago

@sivaprasad-cs -- does @alexott's explanation suffice? Has the customer tried using the latest version?

As Alex noted, anything below 1.0 isn't supported, and even then, we highly suggest always trying out using the latest version. If they still encounter any issues with the latest version, then we can certainly have a look again.