Closed Sabutobi closed 4 years ago
@Sabutobi Could you share us your DAG code? I suppose it's related to the fact that LocalExecutor is running the code using multiprocessing and producer can't be shared between processes. You need to make sure you instantiate the Producer inside your operator, not in the plain DAG definition file. I did not investigate further into how the executor actually loads those DAG files, so I may be wrong. Also, could you try SequentialExecutor on your code, it should not produce this behaviour if I am right.
Hi @tvoinarovskyi From airflow.cfg:
executor = SequentialExecutor
The code:
from kafka import KafkaProducer
kafka_host = 'kafka:9092'
producer = KafkaProducer(bootstrap_servers=kafka_host,acks='all',retries=0)
default_args = {
'owner': 'airflow',
'depends_on_past': True,
'wait_for_downstream': True,
'max_active_runs': 1,
'start_date': datetime(date.today().year, date.today().month, date.today().day),
}
dag = DAG(
dag_id='some_id',
default_args=default_args,
concurrency=1,
catchup=False,
schedule_interval="7 2-22/4 * * *")
def social_daily_load():
print("Webtraffic upload started...")
payload = {
'topic': 'Webtraffic', "msg": 'load_webtraffic', "domainName": 'cleaf.it'
}
where_to_send = payload['topic']
encoded_payload = json.dumps(payload).encode('utf-8')
producer.send(where_to_send,
encoded_payload)
print('message sent')
main_load_op = PythonOperator(
python_callable=social_daily_load, task_id="some_id", dag=dag)
The result:
webserver_1 | Updated cluster metadata to ClusterMetadata(brokers: 1, topics: 3, groups: 0)
webserver_1 | [2020-03-20 09:25:08,810] {{cluster.py:325}} DEBUG - Updated cluster metadata to ClusterMetadata(brokers: 1, topics: 3, groups: 0)
webserver_1 | Running %s on host %s <TaskInstance: hourly_webtraffic_dag.webtraffic_load 2020-03-20T02:07:00+00:00 [queued]> 09c5ff63d865
webserver_1 | Sending (key=None value=b'{"topic": "Webtraffic", "msg": "load_webtraffic", "domainName": "cleaf.it"}' headers=[]) to TopicPartition(topic='Webtraffic', partition=0)
webserver_1 | Allocating a new 16384 byte message buffer for TopicPartition(topic='Webtraffic', partition=0)
webserver_1 | Waking up the sender since TopicPartition(topic='Webtraffic', partition=0) is either full or getting a new batch
webserver_1 | Closing the Kafka producer with 0 secs timeout.
webserver_1 | [2020-03-20 09:25:18,881] {{kafka.py:471}} INFO - Closing the Kafka producer with 0 secs timeout.
webserver_1 | Proceeding to force close the producer since pending requests could not be completed within timeout 0.
webserver_1 | [2020-03-20 09:25:18,881] {{kafka.py:489}} INFO - Proceeding to force close the producer since pending requests could not be completed within timeout 0.
webserver_1 | The Kafka producer has closed.
webserver_1 | [2020-03-20 09:25:18,882] {{kafka.py:502}} DEBUG - The Kafka producer has closed.
webserver_1 | Kafka producer closed
webserver_1 | [2020-03-20 09:25:18,972] {{kafka.py:461}} INFO - Kafka producer closed
webserver_1 | [2020-03-20 09:25:19,035] {{scheduler_job.py:1311}} INFO - Executor reports execution of hourly_webtraffic_dag.webtraffic_load execution_date=2020-03-20 02:07:00+00:00 exited with status success for try_number 1
@tvoinarovskyi thanks a lot for helping me. I've found the solution. Do not know is that perfect but: In the dag execution function, I did only sending the message from "global" producer. But what if initialize produces inside dag executor scope:
def social_daily_load():
# line below was the solution for me
producer = KafkaProducer(bootstrap_servers=kafka_host,acks='all',retries=0)
payload = {
'topic': 'Webtraffic', "msg": 'load_webtraffic', "domainName": 'cleaf.it'
}
where_to_send = payload['topic']
encoded_payload = json.dumps(payload).encode('utf-8')
producer.send(where_to_send,
encoded_payload)
print('message sent')
Problem: for every message I'll initialize new KafkaProducer and there can be performance issues with that. Thanks once more @tvoinarovskyi for fresh ideas.
@Sabutobi Great that it worked for you. A few points:
producer.flush()
or wait for futures by future. wait()
or pass timeout to close()
. If the process finishes before batch is flushed you will lose the message. (See the message in log webserver_1 | Proceeding to force close the producer since pending requests could not be completed within timeout 0.
- that is not good)Sadly Airflow is meant to be run that way where it creates the DAG per each run. Depending on the executor model it is not possible to share state. My suggestion about global Producer may work for Local/Sequential, but will not work for Celery or Kubernetes, as they spawn processes per each run. I would recommend sending big chunks of data per one DAGRun. I hope you will find your best configuration!
@tvoinarovskyi
I have added flush
. I just forgot to copy to Git hub post. Thanks once more for helping me.
Oh, and just a side note:
'start_date': datetime(date.today().year, date.today().month, date.today().day),
Set the start date to a specific date, you will have a bunch of problems with Admin interface is this is not static. You already do catchup=False,
so it will not backfill DAGs if they are missed
Hi, all. This is the kafka-docker.
That is the airflow container:
How that works if just connect via
docker exec -it contairerId bash
?Works fine and no problem met. But: If I'll add identical code to the airflow dag:
And nothing happened. Message not sent. No errors. Nothing. Maybe you'll have some fresh ideas what I've done wrong?