NetApp / netapp-dataops-toolkit

The NetApp DataOps Toolkit is a Python library that makes it simple for developers, data scientists, DevOps engineers, and data engineers to perform various data management tasks, such as near-instantaneously provisioning, cloning, or snapshotting a data volume or JupyterLab workspace.
BSD 3-Clause "New" or "Revised" License
46 stars 12 forks source link

Question about TR-4798 Airflow sample code execution #4

Closed sakaia closed 3 years ago

sakaia commented 3 years ago

Question about TR-4798 Airflow sample code

Sorry for NOT NetApp Data science toolkit problem, this is Airflow configuration problem with NetApp SDK. But I am annoying to solve. any suggestions are appreciate.

I simplify rewrite ai training code (of TR4798) to snapshot code. It works fine for Airflow with pip installation but not fine for helm deployment. Is there any idea to solve this problem? Following commands are executed on airflow web server by helm deployment. In this environment, airflow sample code tutorial.py works fine

airflow test ai_training_run3 model-snapshot  2021-01-24
airflow backfill ai_training_run3 -s  2021-01-24

Test code is below

# Airflow DAG Definition: Snapshot for Airflow Helm
#
# Steps:
# 4. Model snapshot (for versioning/baselining)

from airflow.utils.dates import days_ago
from airflow.secrets import get_connections
from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

##### DEFINE PARAMETERS: Modify parameter values in this section to match your environment #####
## Define default args for DAG
ai_training_run_dag_default_args = {
    'owner': 'NetApp'
}

## Define DAG details
ai_training_run_dag = DAG(
    dag_id='ai_training_run3',
    default_args=ai_training_run_dag_default_args,
    schedule_interval=None,
    start_date=days_ago(2),
    tags=['training']
)
## Define volume details (change values as necessary to match your environment)

# ONTAP system details
airflowConnectionName = 'ontap_ai' # Name of the Airflow connection that contains connection details for your ONTAP system's cluster admin account
verifySSLCert = False # Denotes whether or not to verify the SSL cert when calling the ONTAP API

# Model volume
## test for snapshot
model_volume_pv_name = 'pvc-ab8c7f2b-1e8b-4758-acfd-6f698782efd8'

################################################################################################
# Define function that triggers the creation of a NetApp snapshot
def netappSnapshot(**kwargs) -> str :
    # Parse args
    for key, value in kwargs.items() :
        if key == 'pvName' :
            pvName = value
        elif key == 'verifySSLCert' :
            verifySSLCert = value
        elif key == 'airflowConnectionName' :
            airflowConnectionName = value

    # Install netapp_ontap package
    import sys, subprocess
    result = subprocess.check_output([sys.executable, '-m', 'pip', 'install', '--user', 'netapp-ontap'])
    print(str(result).replace('\\n', '\n'))

    # Import needed functions/classes
    from netapp_ontap import config as netappConfig
    from netapp_ontap.host_connection import HostConnection as NetAppHostConnection
    from netapp_ontap.resources import Volume, Snapshot
    from datetime import datetime
    import json

    # Retrieve ONTAP cluster admin account details from Airflow connection
    connections = get_connections(conn_id = airflowConnectionName)
    ontapConnection = connections[0] # Assumes that you only have one connection with the specified conn_id configured in Airflow
    ontapClusterAdminUsername = ontapConnection.login
    ontapClusterAdminPassword = ontapConnection.password
    ontapClusterMgmtHostname = ontapConnection.host
    # Configure connection to ONTAP cluster/instance
    netappConfig.CONNECTION = NetAppHostConnection(
        host = ontapClusterMgmtHostname,
        username = ontapClusterAdminUsername,
        password = ontapClusterAdminPassword,
        verify = verifySSLCert
    )

    # Convert pv name to ONTAP volume name
    # The following will not work if you specified a custom storagePrefix when creating your
    # Trident backend. If you specified a custom storagePrefix, you will need to update this
    # code to match your prefix.
    volumeName = 'trident_%s' % pvName.replace("-", "_")
    print('\npv name: ', pvName)
    print('ONTAP volume name: ', volumeName)

    # Create snapshot; print API response
    volume = Volume.find(name = volumeName)
    timestamp = datetime.today().strftime("%Y%m%d_%H%M%S")
    snapshot = Snapshot.from_dict({
        'name': 'airflow_%s' % timestamp,
        'comment': 'Snapshot created by a Apache Airflow DAG',
        'volume': volume.to_dict()
    })
    response = snapshot.post()
    print("\nAPI Response:")
    print(response.http_response.text)

    # Retrieve snapshot details
    snapshot.get()
    # Convert snapshot details to JSON string and print
    snapshotDetails = snapshot.to_dict()
    print("\nSnapshot Details:")
    print(json.dumps(snapshotDetails, indent=2))

    # Return name of newly created snapshot
    return snapshotDetails['name']

# Define DAG steps/workflow
with ai_training_run_dag as dag :
    model_snapshot = PythonOperator(
        task_id='model-snapshot',
        python_callable=netappSnapshot,
        op_kwargs={
            'airflowConnectionName': airflowConnectionName,
            'pvName': model_volume_pv_name,
            'verifySSLCert': verifySSLCert
        },
        dag=dag
    )

airflow test's successful log is follows.

/home/airflow/.local/lib/python3.6/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
  from airflow.contrib.kubernetes.pod import _extract_volume_mounts
[2021-01-29 09:24:06,680] {__init__.py:50} INFO - Using executor KubernetesExecutor
[2021-01-29 09:24:06,685] {dagbag.py:417} INFO - Filling up the DagBag from /opt/airflow/dags
[2021-01-29 09:24:06,728] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: ai_training_run3.model-snapshot 2021-01-24T00:00:00+00:00 [failed]>
[2021-01-29 09:24:06,757] {taskinstance.py:670} INFO - Dependencies all met for <TaskInstance: ai_training_run3.model-snapshot 2021-01-24T00:00:00+00:00 [failed]>
[2021-01-29 09:24:06,757] {taskinstance.py:880} INFO -
--------------------------------------------------------------------------------
[2021-01-29 09:24:06,757] {taskinstance.py:881} INFO - Starting attempt 1 of 1
[2021-01-29 09:24:06,757] {taskinstance.py:882} INFO -
--------------------------------------------------------------------------------
[2021-01-29 09:24:06,759] {taskinstance.py:901} INFO - Executing <Task(PythonOperator): model-snapshot> on 2021-01-24T00:00:00+00:00
You are using pip version 19.0.2, however version 21.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
b'Requirement already satisfied: netapp-ontap in /home/airflow/.local/lib/python3.6/site-packages (9.8.0)
Requirement already satisfied: marshmallow>=3.2.1 in /home/airflow/.local/lib/python3.6/site-packages (from netapp-ontap) (3.10.0)
Requirement already satisfied: requests-toolbelt>=0.9.1 in /home/airflow/.local/lib/python3.6/site-packages (from netapp-ontap) (0.9.1)
Requirement already satisfied: requests>=2.21.0 in /home/airflow/.local/lib/python3.6/site-packages (from netapp-ontap) (2.24.0)
Requirement already satisfied: idna<3,>=2.5 in /home/airflow/.local/lib/python3.6/site-packages (from requests>=2.21.0->netapp-ontap) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /home/airflow/.local/lib/python3.6/site-packages (from requests>=2.21.0->netapp-ontap) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /home/airflow/.local/lib/python3.6/site-packages (from requests>=2.21.0->netapp-ontap) (1.25.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /home/airflow/.local/lib/python3.6/site-packages (from requests>=2.21.0->netapp-ontap) (3.0.4)
'

pv name:  pvc-ab8c7f2b-1e8b-4758-acfd-6f698782efd8
ONTAP volume name:  trident_pvc_ab8c7f2b_1e8b_4758_acfd_6f698782efd8
[2021-01-29 09:24:15,304] {utils.py:183} INFO - Job (success): success. Timeout remaining: 25.

API Response:
{
  "uuid": "be15648f-6213-11eb-873b-000c29b49d36",
  "state": "success",
  "message": "success",
  "_links": {
    "self": {
      "href": "/api/cluster/jobs/be15648f-6213-11eb-873b-000c29b49d36"
    }
  }
}

Snapshot Details:
{
  "_links": {
    "self": {
      "href": "/api/storage/volumes/6482f102-5bc0-11eb-873b-000c29b49d36/snapshots/be1555c7-6213-11eb-873b-000c29b49d36"
    }
  },
  "svm": {
    "_links": {
      "self": {
        "href": "/api/svm/svms/f2afbbc8-5bba-11eb-873b-000c29b49d36"
      }
    },
    "name": "svm0",
    "uuid": "f2afbbc8-5bba-11eb-873b-000c29b49d36"
  },
  "name": "airflow_20210129_092410",
  "volume": {
    "_links": {
      "self": {
        "href": "/api/storage/volumes/6482f102-5bc0-11eb-873b-000c29b49d36"
      }
    },
    "name": "trident_pvc_ab8c7f2b_1e8b_4758_acfd_6f698782efd8",
    "uuid": "6482f102-5bc0-11eb-873b-000c29b49d36"
  },
  "create_time": "2021-01-29T09:24:09+00:00",
  "comment": "Snapshot created by a Apache Airflow DAG",
  "uuid": "be1555c7-6213-11eb-873b-000c29b49d36"
}
[2021-01-29 09:24:15,871] {python_operator.py:114} INFO - Done. Returned value was: airflow_20210129_092410
[2021-01-29 09:24:15,889] {taskinstance.py:1070} INFO - Marking task as SUCCESS.dag_id=ai_training_run3, task_id=model-snapshot, execution_date=20210124T000000, start_date=20210129T090528, end_date=20210129T092415

Failed on airflow backfill log is follows,

/home/airflow/.local/lib/python3.6/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
  from airflow.contrib.kubernetes.pod import _extract_volume_mounts
[2021-01-29 09:26:08,601] {__init__.py:50} INFO - Using executor KubernetesExecutor
[2021-01-29 09:26:08,603] {dagbag.py:417} INFO - Filling up the DagBag from /opt/airflow/dags
[2021-01-29 09:26:08,650] {kubernetes_executor.py:770} INFO - Start Kubernetes executor
[2021-01-29 09:26:08,787] {kubernetes_executor.py:302} INFO - Event: and now my watch begins starting at resource_version: 0
[2021-01-29 09:26:08,798] {kubernetes_executor.py:698} INFO - When executor started up, found 0 queued task instances
[2021-01-29 09:26:08,823] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-78cbebdb8df042d1b9b1464ccfc20323 had an event of type ADDED
[2021-01-29 09:26:08,823] {kubernetes_executor.py:371} INFO - Event: aitrainingrun3modelsnapshot-78cbebdb8df042d1b9b1464ccfc20323 Failed
[2021-01-29 09:26:08,832] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-7888c3346dda402f885266545979eff2 had an event of type ADDED
[2021-01-29 09:26:08,832] {kubernetes_executor.py:371} INFO - Event: aitrainingrun3modelsnapshot-7888c3346dda402f885266545979eff2 Failed
[2021-01-29 09:26:08,838] {kubernetes_executor.py:327} INFO - Event: aitrainingrun2modelsnapshot-ff91e3c9351d41d6a959233e9b62507f had an event of type ADDED
[2021-01-29 09:26:08,838] {kubernetes_executor.py:371} INFO - Event: aitrainingrun2modelsnapshot-ff91e3c9351d41d6a959233e9b62507f Failed
[2021-01-29 09:26:08,912] {base_executor.py:58} INFO - Adding to queue: ['airflow', 'run', 'ai_training_run3', 'model-snapshot', '2021-01-22T00:00:00+00:00', '--pickle', '23', '--local', '--pool', 'default_pool']
[2021-01-29 09:26:13,632] {kubernetes_executor.py:792} INFO - Add task ('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 22, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1) with command ['airflow', 'run', 'ai_training_run3', 'model-snapshot', '2021-01-22T00:00:00+00:00', '--pickle', '23', '--local', '--pool', 'default_pool'] with executor_config {}
[2021-01-29 09:26:13,637] {kubernetes_executor.py:499} INFO - Attempting to finish pod; pod_id: aitrainingrun3modelsnapshot-78cbebdb8df042d1b9b1464ccfc20323; state: failed; labels: {'airflow-worker': '74162e61-0097-4dd7-921d-dd290889ba8c', 'airflow_version': '1.10.12', 'dag_id': 'ai_training_run3', 'execution_date': '2021-01-23T00_00_00_plus_00_00', 'kubernetes_executor': 'True', 'task_id': 'model-snapshot', 'try_number': '1'}
[2021-01-29 09:26:13,645] {kubernetes_executor.py:599} INFO - Found matching task ai_training_run3-model-snapshot (2021-01-23 00:00:00+00:00) with current state of failed
[2021-01-29 09:26:13,646] {kubernetes_executor.py:499} INFO - Attempting to finish pod; pod_id: aitrainingrun3modelsnapshot-7888c3346dda402f885266545979eff2; state: failed; labels: {'airflow-worker': '74162e61-0097-4dd7-921d-dd290889ba8c', 'airflow_version': '1.10.12', 'dag_id': 'ai_training_run3', 'execution_date': '2021-01-24T00_00_00_plus_00_00', 'kubernetes_executor': 'True', 'task_id': 'model-snapshot', 'try_number': '1'}
[2021-01-29 09:26:13,652] {kubernetes_executor.py:599} INFO - Found matching task ai_training_run3-model-snapshot (2021-01-24 00:00:00+00:00) with current state of failed
[2021-01-29 09:26:13,654] {kubernetes_executor.py:499} INFO - Attempting to finish pod; pod_id: aitrainingrun2modelsnapshot-ff91e3c9351d41d6a959233e9b62507f; state: failed; labels: {'airflow-worker': '74162e61-0097-4dd7-921d-dd290889ba8c', 'airflow_version': '1.10.12', 'dag_id': 'ai_training_run2', 'execution_date': '2021-01-23T00_00_00_plus_00_00', 'kubernetes_executor': 'True', 'task_id': 'model-snapshot', 'try_number': '1'}
[2021-01-29 09:26:13,659] {kubernetes_executor.py:599} INFO - Found matching task ai_training_run2-model-snapshot (2021-01-23 00:00:00+00:00) with current state of failed
[2021-01-29 09:26:13,661] {kubernetes_executor.py:813} INFO - Changing state of (('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 23, 0, 0, tzinfo=tzlocal()), 1), 'failed', 'aitrainingrun3modelsnapshot-78cbebdb8df042d1b9b1464ccfc20323', 'af0', '6787287') to failed
[2021-01-29 09:26:13,662] {kubernetes_executor.py:813} INFO - Changing state of (('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 24, 0, 0, tzinfo=tzlocal()), 1), 'failed', 'aitrainingrun3modelsnapshot-7888c3346dda402f885266545979eff2', 'af0', '6786958') to failed
[2021-01-29 09:26:13,663] {kubernetes_executor.py:813} INFO - Changing state of (('ai_training_run2', 'model-snapshot', datetime.datetime(2021, 1, 23, 0, 0, tzinfo=tzlocal()), 1), 'failed', 'aitrainingrun2modelsnapshot-ff91e3c9351d41d6a959233e9b62507f', 'af0', '6784547') to failed
[2021-01-29 09:26:13,668] {kubernetes_executor.py:429} INFO - Kubernetes job is (('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 22, 0, 0, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>), 1), ['airflow', 'run', 'ai_training_run3', 'model-snapshot', '2021-01-22T00:00:00+00:00', '--pickle', '23', '--local', '--pool', 'default_pool'], None)
/home/airflow/.local/lib/python3.6/site-packages/airflow/kubernetes/pod_launcher.py:330: DeprecationWarning: Using `airflow.contrib.kubernetes.pod.Pod` is deprecated. Please use `k8s.V1Pod`.
  security_context=_extract_security_context(pod.spec.security_context)
/home/airflow/.local/lib/python3.6/site-packages/airflow/kubernetes/pod_launcher.py:77: DeprecationWarning: Using `airflow.contrib.kubernetes.pod.Pod` is deprecated. Please use `k8s.V1Pod` instead.
  pod = self._mutate_pod_backcompat(pod)
[2021-01-29 09:26:13,714] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 had an event of type ADDED
[2021-01-29 09:26:13,714] {kubernetes_executor.py:369} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 Pending
[2021-01-29 09:26:13,716] {backfill_job.py:247} WARNING - ('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 23, 0, 0, tzinfo=tzlocal()), 1) state failed not in running=dict_values([<TaskInstance: ai_training_run3.model-snapshot 2021-01-22 00:00:00+00:00 [queued]>])
[2021-01-29 09:26:13,717] {backfill_job.py:247} WARNING - ('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 24, 0, 0, tzinfo=tzlocal()), 1) state failed not in running=dict_values([<TaskInstance: ai_training_run3.model-snapshot 2021-01-22 00:00:00+00:00 [queued]>])
[2021-01-29 09:26:13,717] {backfill_job.py:247} WARNING - ('ai_training_run2', 'model-snapshot', datetime.datetime(2021, 1, 23, 0, 0, tzinfo=tzlocal()), 1) state failed not in running=dict_values([<TaskInstance: ai_training_run3.model-snapshot 2021-01-22 00:00:00+00:00 [queued]>])
[2021-01-29 09:26:13,719] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 had an event of type MODIFIED
[2021-01-29 09:26:13,720] {kubernetes_executor.py:369} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 Pending
[2021-01-29 09:26:13,746] {backfill_job.py:364} INFO - [backfill progress] | finished run 0 of 1 | tasks waiting: 0 | succeeded: 0 | running: 1 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 0
[2021-01-29 09:26:14,706] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 had an event of type MODIFIED
[2021-01-29 09:26:14,706] {kubernetes_executor.py:369} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 Pending
[2021-01-29 09:26:18,662] {backfill_job.py:364} INFO - [backfill progress] | finished run 0 of 1 | tasks waiting: 0 | succeeded: 0 | running: 1 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 0
[2021-01-29 09:26:21,461] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 had an event of type MODIFIED
[2021-01-29 09:26:21,461] {kubernetes_executor.py:369} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 Pending
[2021-01-29 09:26:22,907] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 had an event of type MODIFIED
[2021-01-29 09:26:22,907] {kubernetes_executor.py:377} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 is Running
[2021-01-29 09:26:23,673] {backfill_job.py:364} INFO - [backfill progress] | finished run 0 of 1 | tasks waiting: 0 | succeeded: 0 | running: 1 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 0
[2021-01-29 09:26:38,081] {backfill_job.py:364} INFO - [backfill progress] | finished run 0 of 1 | tasks waiting: 0 | succeeded: 0 | running: 1 | failed: 0 | skipped: 0 | deadlocked: 0 | not ready: 0
[2021-01-29 09:26:41,675] {kubernetes_executor.py:327} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 had an event of type MODIFIED
[2021-01-29 09:26:41,676] {kubernetes_executor.py:371} INFO - Event: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3 Failed
[2021-01-29 09:26:57,524] {kubernetes_executor.py:499} INFO - Attempting to finish pod; pod_id: aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3; state: failed; labels: {'airflow-worker': '74162e61-0097-4dd7-921d-dd290889ba8c', 'airflow_version': '1.10.12', 'dag_id': 'ai_training_run3', 'execution_date': '2021-01-22T00_00_00_plus_00_00', 'kubernetes_executor': 'True', 'task_id': 'model-snapshot', 'try_number': '1'}
[2021-01-29 09:26:57,531] {kubernetes_executor.py:599} INFO - Found matching task ai_training_run3-model-snapshot (2021-01-22 00:00:00+00:00) with current state of queued
[2021-01-29 09:26:57,533] {kubernetes_executor.py:813} INFO - Changing state of (('ai_training_run3', 'model-snapshot', datetime.datetime(2021, 1, 22, 0, 0, tzinfo=tzlocal()), 1), 'failed', 'aitrainingrun3modelsnapshot-97c7cc7133504edca58f449d21557ca3', 'af0', '6789836') to failed
[2021-01-29 09:26:57,543] {backfill_job.py:261} ERROR - Executor reports task instance <TaskInstance: ai_training_run3.model-snapshot 2021-01-22 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
[2021-01-29 09:26:57,550] {taskinstance.py:1150} ERROR - Executor reports task instance <TaskInstance: ai_training_run3.model-snapshot 2021-01-22 00:00:00+00:00 [queued]> finished (failed) although the task says its queued. Was the task killed externally?
NoneType: None
[2021-01-29 09:26:57,550] {taskinstance.py:1194} INFO - Marking task as FAILED. dag_id=ai_training_run3, task_id=model-snapshot, execution_date=20210122T000000, start_date=20210129T092608, end_date=20210129T092657
[2021-01-29 09:26:57,564] {backfill_job.py:205} ERROR - Task instance <TaskInstance: ai_training_run3.model-snapshot 2021-01-22 00:00:00+00:00 [failed]> failed
[2021-01-29 09:26:57,571] {dagrun.py:311} INFO - Marking run <DagRun ai_training_run3 @ 2021-01-22T00:00:00+00:00: backfill_2021-01-22T00:00:00+00:00, externally triggered: False> failed
[2021-01-29 09:26:57,575] {backfill_job.py:364} INFO - [backfill progress] | finished run 1 of 1 | tasks waiting: 0 | succeeded: 0 | running: 0 | failed: 1 | skipped: 0 | deadlocked: 0 | not ready: 0
[2021-01-29 09:26:57,577] {kubernetes_executor.py:892} INFO - Shutting down Kubernetes executor
Traceback (most recent call last):
  File "/home/airflow/.local/bin/airflow", line 37, in <module>
    args.func(args)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/cli.py", line 76, in wrapper
    return f(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 236, in backfill
    run_backwards=args.run_backwards
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/models/dag.py", line 1432, in run
    job.run()
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/jobs/base_job.py", line 218, in run
    self._execute()
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/db.py", line 74, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.6/site-packages/airflow/jobs/backfill_job.py", line 794, in _execute
    raise AirflowException(err)
airflow.exceptions.AirflowException: Some task instances failed:
DAG ID            Task ID         Execution date               Try number
----------------  --------------  -------------------------  ------------
ai_training_run3  model-snapshot  2021-01-22 00:00:00+00:00             1

for airflow run failed on follows

/home/airflow/.local/lib/python3.6/site-packages/airflow/kubernetes/pod_generator.py:39: DeprecationWarning: This module is deprecated. Please use `airflow.kubernetes.pod`.
  from airflow.contrib.kubernetes.pod import _extract_volume_mounts
[2021-01-29 09:31:21,766] {__init__.py:50} INFO - Using executor KubernetesExecutor
[2021-01-29 09:31:21,770] {dagbag.py:417} INFO - Filling up the DagBag from /opt/airflow/dags
Running %s on host %s <TaskInstance: ai_training_run3.model-snapshot 2021-01-22T00:00:00+00:00 [failed]> airflow-web-766b564fb4-bm6x9
mboglesby commented 3 years ago

Closing since this is not a data science toolkit issue. I believe that this is the same issue that is being discussed here: https://github.com/NetAppDocs/netapp-solutions/issues/28. Please let me know if that is not correct.