Azure / Azure-DataFactory

Other
481 stars 586 forks source link

Data Copy Activity does not honor defined schemas #254

Open winggndm opened 3 years ago

winggndm commented 3 years ago

There seems to be a bug in the Data Factory Copy Activity. When we use the partition discovery, the columns that come from the partitions are not following the types defined in the source or sync. They get forced to be UTF8 strings. The Data flows on the other hand does honor the type definitions that we specify in the source and destination. This is leading to parquet files that are generated from Data Copy and Dataflow to be incompatible as the schema is different and there is no way for us to enforce the schema structure in the Data Copy from what I can tell.

Example, on the source and sink we define a column Column1 of type INT32. This column comes from the partition discovery and so in the folder structure contains Column1=#. When we hit import schema in the data copy activity for the conversion, it imports them as String -> UTF8. This leads to the parquet file having a string type as the column. We manually edited the conversion settings to try String->INT32 and the parquet file still had String as the column type. We then tried to do INT32->INT32 and still it was a string. There is no way for us to enforce the right schema in the destination parquet file.

fhljys commented 3 years ago

Do you have an activity id? Are you still hitting this?

refex commented 2 years ago

@fhljys I'm experiencing something similar: I have a copy activity with a MSSQL source dataset and a Azure Data Lake Storage Gen2 avro dataset as sink. The mapping is explicit and there are 5 fields that are mapped as datetime to DataTime inside the copy activity image Nevertheless the avro file does not contains any DateTime, these fields are defined as String | null in avro schema.

Activity run id: 52f5d893-0480-4ee4-b3b6-8103941a8aeb

Alessandro91-dev commented 2 years ago

Having same "Issue" with parquet file and ADF Copy Data task. For CSV its working fine but not for parquet file. https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-schema-and-type-mapping#data-type-mapping image image

In Parquet its not generating the Autocreate Table to be specified Data Type image image

davidzhaoyue commented 2 years ago

Here're some clarifications.

  1. The type in the Copy JSON payload is only for UI display purpose. It's not for you to enforce type conversion at runtime. Copy activity doesn't have such a feature right now.
  2. It's by design that the values of partitioning columns are always treated as string, because the values are from folder names specified in string type. It could be a new feature to specify their data types in the future.