databrickslabs / dbldatagen

Generate relevant synthetic data quickly for your projects. The Databricks Labs synthetic data generator (aka `dbldatagen`) may be used to generate large simulated / synthetic data sets for test, POCs, and other uses in Databricks environments including in Delta Live Tables pipelines
https://databrickslabs.github.io/dbldatagen
Other
357 stars 61 forks source link

Problem with column which is named "ID" #107

Closed SemyonSinchenko closed 1 year ago

SemyonSinchenko commented 2 years ago

Expected Behavior

Generation of column with name "ID".

Current Behavior

Exception: AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID.

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql import types as T

SparkSession.builder.getOrCreate()

dg.DataGenerator(spark, rows=100, partitions=2).withColumn("ID", T.StringType()).build()

Context

I see that the problem is here And it may be solved by renaming of my column "ID" to "ID_" before generation and then renaming it back after but it looks little creepy for production... Why you cannot use something less frequent usable for inner ID column? Like datagen__technical__inner__id for example?

Your Environment

ronanstokes-db commented 2 years ago

ID and id are reserved column names - we can look at making this configurable if needed but in the current release these are reserved for system use

SemyonSinchenko commented 2 years ago

Best to make it configurable. Or at least raise some exceptions about it on the stage of column adding... Because it is really unobvious to get AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID. Because ID is very often usable name. We have such a column in each GDWH table for example. Thank you!

ronanstokes-db commented 2 years ago

Will add a fix in two phases