Problem with column which is named "ID"

SemyonSinchenko commented 2 years ago

Expected Behavior

Generation of column with name "ID".

Current Behavior

Exception: AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID.

Steps to Reproduce (for bugs)

import dbldatagen as dg
from pyspark.sql import types as T

SparkSession.builder.getOrCreate()

dg.DataGenerator(spark, rows=100, partitions=2).withColumn("ID", T.StringType()).build()

Context

I see that the problem is here And it may be solved by renaming of my column "ID" to "ID_" before generation and then renaming it back after but it looks little creepy for production... Why you cannot use something less frequent usable for inner ID column? Like datagen__technical__inner__id for example?

Your Environment

dbldatagen version used: 0.2.0rc1
Databricks Runtime version: 10.4 LTS
Cloud environment used: AWS

ronanstokes-db commented 2 years ago

ID and id are reserved column names - we can look at making this configurable if needed but in the current release these are reserved for system use

SemyonSinchenko commented 2 years ago

Best to make it configurable. Or at least raise some exceptions about it on the stage of column adding... Because it is really unobvious to get AnalysisException: Reference 'ID' is ambiguous, could be: ID, ID. Because ID is very often usable name. We have such a column in each GDWH table for example. Thank you!

ronanstokes-db commented 2 years ago

Will add a fix in two phases

phase 1 : will warn when column named id is added
phase 2: allow renaming of the seed column

databrickslabs / dbldatagen