airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.04k stars 4.11k forks source link

Sync logs every record to console, makes error debugging impossible on large ingestion #16324

Closed lewisdawson closed 2 years ago

lewisdawson commented 2 years ago
## Environment - **Airbyte version**: 0.35.28-alpha - **OS Version / Instance**: Debian 10, Docker - **Deployment**: Docker - **Source Connector and version**: Custom CDK Python connector - **Destination Connector and version**: Postgres 0.3.13 - **Step where error happened**: Sync job ## Current Behavior I have a Python `HttpStream` source connector `yield`ing over 20 million rows. It's connected to a PostgreSQL destination. Every row that's ingested is output to the logs/console. For example: ```json {"type": "LOG", "log": {"level": "INFO", "message": "Starting syncing mystream"}} {"type": "LOG", "log": {"level": "INFO", "message": "Successfully checked connect parameters for mystream."}} {"type": "LOG", "log": {"level": "INFO", "message": "Syncing stream: mystream "}} {"type": "RECORD", "record": {}, "emitted_at": 1662260358495}} {"type": "RECORD", "record": {}, "emitted_at": 1662260358495}} {"type": "RECORD", "record": {}, "emitted_at": 1662260358496}} ... ``` The size of the logs causes them to not be loaded in the UI. This makes it so I can't debug the issue I'm running into during ingestion. In addition, this is likely slowing ingestion down and consuming vast amounts of space (the job says it consumed 30GB of space). ## Expected Behavior I would expect the output to be the number of records processed every 1000 entries, not every single record. **Is there any switch/mechanism that can be used to suppress this behavior today?** ## Logs Log example provided above. ## Steps to Reproduce 1. Create a `HttpStream` connector 2. Connect `HttpStream` connector to Postgres destination. 3. Watch it log every record consumed to the console ## Are you willing to submit a PR? With someone to point me on where to look, yes.
lewisdawson commented 2 years ago

The issue may be with with the print() statement in entrypoint.py:

def launch(source: Source, args: List[str]):
    source_entrypoint = AirbyteEntrypoint(source)
    parsed_args = source_entrypoint.parse_args(args)
    for message in source_entrypoint.run(parsed_args):
        print(message)

It prints every single row consumed. Perhaps there could be a flag for this or this could be a debug logging statement?

natalyjazzviolin commented 2 years ago

@lewisdawson would you be able to upgrade Airbyte to the newest version to see if the issue persists then?

lewisdawson commented 2 years ago

Did a bit more digging into this and figured out the issue. Was related to how my source connector was written. In short, when the HttpStream.parse_response() function uses return statements to return data, all the records get printed to stdout. When I modified the function to use yield statements to return data, each record is no longer printed.

This issue is good to close out, however, if anyone with more knowledge of Airbyte understands why this happens, would love to get an explanation on this behavior before this is closed out.

natalyjazzviolin commented 2 years ago

I'm glad you were able to figure it out, and thank you for the detailed writeup! I will close the issue as it's been solved, but feel free to post on the forum or in Slack to continue this discussion!