databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
309 stars 58 forks source link

DblDatagenerator causes global logger to issue messages twice in some circumstances #167

Closed MarvinSchenkel closed 1 year ago

MarvinSchenkel commented 1 year ago

Expected Behavior

The logging within the dbldatagenerator package respects the logger that is being used in the package that uses dbldatagenerator.

Current Behavior

dbldatagenerator sometimes calls logging.info and logging.debug without calling getLogger() first, resulting in overwriting whatever logger is being used in the caller's package.

Examples:

Since this happens inside _version.py which is being called in __init__.py, this behaviour occurs whenever we import any class from the dbldatagenerator.

Steps to Reproduce (for bugs)

import logging
date_format = "%Y-%m-%d %H:%M:%S"
log_format = "%(asctime)s %(levelname)-8s  %(message)s"
formatter = logging.Formatter(log_format, date_format)
handler = logging.StreamHandler()
handler.setFormatter(formatter)
logger = logging.getLogger(__name__)
logger.setLevel(level=logging.INFO)
logger.addHandler(handler)
logger.info("Info message")
# Prints: 2023-03-08 14:18:59 INFO      Info message
from dbldatagen import DataGenerator
# Prints: INFO: Version : VersionInfo(major='0', minor='3', patch='1', release='', build='')
logger.info("Info message")
# Prints:
# 2023-03-08 14:18:59 INFO      Info message
# INFO: Info message

Your Environment

ronanstokes-db commented 1 year ago

I'll take a look at this and add a fix - thanks for your feedback

ronanstokes-db commented 1 year ago

Overall the current behavior is when package is loaded that it issues a logging message using the globally configured logger.

The code sample above changes the format for the logger produced by invoking getLogger(__NAME__) inside a notebook. This will not change the format for the globally configured logger.

The code inside of the _version file does not change the settings for the logger (unless the use of the global logger.info call implicitly changes the settings)

MarvinSchenkel commented 1 year ago

I should have been more clear. So it indeed does not change the settings for the logger, but it adds an extra handler, causing the second info message to be logged twice (only without the datetime format since it uses default logging settings).

The code sample above mimics the way we use dbldatagenerator. We basically setup a global logger, then import dbldatagenerator which adds a second log handler. This causes all our globally logged messages to be logged twice.

ronanstokes-db commented 1 year ago

I'll take a look at what PySpark does and follow the same. If you are looking for different behavior than that feel free to clarify in the comments

MarvinSchenkel commented 1 year ago

I'll take a look at what PySpark does and follow the same. If you are looking for different behavior than that feel free to clarify in the comments

As far as I can see, you are trying to do the same thing as what PySpark does and this is what I would expect. However, the logging calls inside the _version.py just 'forget' to call getLogger(__name__) first.

ronanstokes-db commented 1 year ago

Fixed in PR

MarvinSchenkel commented 1 year ago

Hey @ronanstokes-db , tested the new release and found two more minor things. I created PR #205 for this. This fixes the problem posted in the OP of this issue. Hope this helps!

ronanstokes-db commented 1 year ago

Thank you very much for your contribution

ronanstokes-db commented 1 year ago

Will be released as "release/v0.3.4post1"