apache / iceberg-python

Apache PyIceberg
https://py.iceberg.apache.org/
Apache License 2.0
315 stars 116 forks source link

ValueError: Mismatch in fields: ? #674

Open djouallah opened 2 months ago

djouallah commented 2 months ago

Question

I am just appending to the same table using the same data,

 try:
    catalog.create_table("aemo.scada",schema=df.schema)
  except:
    catalog.load_table("aemo.scada").append(df)

but I am getting this error

image
Fokko commented 2 months ago

@djouallah If you scroll below, can you see any mismatch fields? Would there be a way to reproduce this?

djouallah commented 2 months ago

will try to reproduce it, but you need an account in R2 or S3, it is just a POC, basically load data using the SQL catalog, then save the sqlite DB in R2, then when I reload the notebook I download the db again

https://colab.research.google.com/drive/10WMnPY1mE9GHm1Zp2wm9PsEGaoCUYjr3?usp=sharing

kevinjqliu commented 2 months ago

Ah, this is because table_schema.as_struct() and task_schema.as_struct() return StructType.fields with the same elements but in different order...

print(table_schema.as_struct().fields)
print(task_schema.as_struct().fields)

(NestedField(field_id=1, name='REGIONID', field_type=StringType(), required=False), NestedField(field_id=2, name='filename', field_type=StringType(), required=False), NestedField(field_id=3, name='UNIT', field_type=StringType(), required=False), NestedField(field_id=4, name='RAISEREGLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=5, name='RAISEREGAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=6, name='APCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=7, name='LOWER60SECROP', field_type=DoubleType(), required=False), NestedField(field_id=8, name='RAISE60SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=9, name='RAISE60SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=10, name='AVAILABLELOAD', field_type=DoubleType(), required=False), NestedField(field_id=11, name='RAISE6SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=12, name='RAISE60SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=13, name='VERSION', field_type=DoubleType(), required=False), NestedField(field_id=14, name='TOTALDEMAND', field_type=DoubleType(), required=False), NestedField(field_id=15, name='RAISE60SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=16, name='LOWER5MINIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=17, name='LOWER5MINRRP', field_type=DoubleType(), required=False), NestedField(field_id=18, name='LOWERREGRRP', field_type=DoubleType(), required=False), NestedField(field_id=19, name='RAISE6SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=20, name='RAISE6SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=21, name='LOWER5MINSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=22, name='RAISE5MINREQ', field_type=DoubleType(), required=False), NestedField(field_id=23, name='LOWER60SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=24, name='INITIALSUPPLY', field_type=DoubleType(), required=False), NestedField(field_id=25, name='RAISEREGVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=26, name='DISPATCHABLEGENERATION', field_type=DoubleType(), required=False), NestedField(field_id=27, name='LOWER60SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=28, name='MARKETSUSPENDEDFLAG', field_type=DoubleType(), required=False), NestedField(field_id=29, name='RAISE6SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=30, name='RAISE60SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=31, name='LOWER60SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=32, name='RAISEREGRRP', field_type=DoubleType(), required=False), NestedField(field_id=33, name='RAISE5MINLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=34, name='RAISE5MINLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=35, name='RAISE6SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=36, name='RAISEREGLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=37, name='LOWERREGAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=38, name='LOWER5MINLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=39, name='LOWERREGLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=40, name='LOWER6SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=41, name='LOWER60SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=42, name='LOWERREGLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=43, name='RAISE60SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=44, name='RAISE5MINLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=45, name='RAISE6SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=46, name='RAISE5MINIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=47, name='RAISE60SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=48, name='LRCSURPLUS', field_type=DoubleType(), required=False), NestedField(field_id=49, name='LOWER5MINLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=50, name='CLEAREDSUPPLY', field_type=DoubleType(), required=False), NestedField(field_id=51, name='RAISEREGLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=52, name='LOWER60SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=53, name='LOWER5MINLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=54, name='LOWER6SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=55, name='LOWER6SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=56, name='LOWER5MINVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=57, name='RRP', field_type=DoubleType(), required=False), NestedField(field_id=58, name='RAISE5MINPRICE', field_type=DoubleType(), required=False), NestedField(field_id=59, name='RAISE6SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=60, name='LOWERREGROP', field_type=DoubleType(), required=False), NestedField(field_id=61, name='RAISE60SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=62, name='RAISEREGROP', field_type=DoubleType(), required=False), NestedField(field_id=63, name='RAISE5MINACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=64, name='RAISEREGREQ', field_type=DoubleType(), required=False), NestedField(field_id=65, name='RAISE60SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=66, name='INTERVENTION', field_type=DoubleType(), required=False), NestedField(field_id=67, name='RUNNO', field_type=DoubleType(), required=False), NestedField(field_id=68, name='LOWER6SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=69, name='LOWER60SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=70, name='RAISEREGACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=71, name='LOWER60SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=72, name='LOWER5MINREQ', field_type=DoubleType(), required=False), NestedField(field_id=73, name='LOWER6SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=74, name='RAISE5MINDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=75, name='RAISE60SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=76, name='EEP', field_type=DoubleType(), required=False), NestedField(field_id=77, name='LOWER6SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=78, name='LOWER6SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=79, name='DEMANDFORECAST', field_type=DoubleType(), required=False), NestedField(field_id=80, name='LOWER5MINDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=81, name='RAISE6SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=82, name='LOWER6SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=83, name='AGGREGATEDISPATCHERROR', field_type=DoubleType(), required=False), NestedField(field_id=84, name='LOWER6SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=85, name='RAISE6SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=86, name='LOWER5MINAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=87, name='LOWERREGACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=88, name='LOWER6SECROP', field_type=DoubleType(), required=False), NestedField(field_id=89, name='RAISE6SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=90, name='LOWERREGLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=91, name='LOWER5MINLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=92, name='EXCESSGENERATION', field_type=DoubleType(), required=False), NestedField(field_id=93, name='RAISEREGIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=94, name='LOWER5MINROP', field_type=DoubleType(), required=False), NestedField(field_id=95, name='LOWERREGVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=96, name='LORSURPLUS', field_type=DoubleType(), required=False), NestedField(field_id=97, name='RAISE60SECROP', field_type=DoubleType(), required=False), NestedField(field_id=98, name='RAISE6SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=99, name='LOWER6SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=100, name='RAISE5MINROP', field_type=DoubleType(), required=False), NestedField(field_id=101, name='RAISE5MINAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=102, name='LOWER60SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=103, name='LOWER60SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=104, name='RAISE6SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=105, name='RAISE5MINSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=106, name='LOWER60SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=107, name='NETINTERCHANGE', field_type=DoubleType(), required=False), NestedField(field_id=108, name='LOWER60SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=109, name='DISPATCHABLELOAD', field_type=DoubleType(), required=False), NestedField(field_id=110, name='LOWER6SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=111, name='ROP', field_type=DoubleType(), required=False), NestedField(field_id=112, name='RAISE5MINRRP', field_type=DoubleType(), required=False), NestedField(field_id=113, name='LOWER6SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=114, name='RAISE60SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=115, name='AVAILABLEGENERATION', field_type=DoubleType(), required=False), NestedField(field_id=116, name='RAISE60SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=117, name='LOWER60SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=118, name='RAISE60SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=119, name='LOWERREGREQ', field_type=DoubleType(), required=False), NestedField(field_id=120, name='RAISE6SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=121, name='RAISE5MINVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=122, name='LOWER60SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=123, name='LOWER6SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=124, name='LOWER5MINACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=125, name='LOWER5MINPRICE', field_type=DoubleType(), required=False), NestedField(field_id=126, name='LOWERREGIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=127, name='RAISE6SECROP', field_type=DoubleType(), required=False), NestedField(field_id=128, name='RAISE5MINLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=129, name='SETTLEMENTDATE', field_type=TimestampType(), required=False), NestedField(field_id=130, name='file', field_type=StringType(), required=False), NestedField(field_id=131, name='YEAR', field_type=LongType(), required=False))
(NestedField(field_id=1, name='REGIONID', field_type=StringType(), required=False), NestedField(field_id=2, name='filename', field_type=StringType(), required=False), NestedField(field_id=3, name='UNIT', field_type=StringType(), required=False), NestedField(field_id=79, name='DEMANDFORECAST', field_type=DoubleType(), required=False), NestedField(field_id=33, name='RAISE5MINLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=80, name='LOWER5MINDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=36, name='RAISEREGLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=82, name='LOWER6SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=25, name='RAISEREGVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=28, name='MARKETSUSPENDEDFLAG', field_type=DoubleType(), required=False), NestedField(field_id=119, name='LOWERREGREQ', field_type=DoubleType(), required=False), NestedField(field_id=115, name='AVAILABLEGENERATION', field_type=DoubleType(), required=False), NestedField(field_id=75, name='RAISE60SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=5, name='RAISEREGAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=110, name='LOWER6SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=23, name='LOWER60SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=85, name='RAISE6SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=76, name='EEP', field_type=DoubleType(), required=False), NestedField(field_id=67, name='RUNNO', field_type=DoubleType(), required=False), NestedField(field_id=121, name='RAISE5MINVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=83, name='AGGREGATEDISPATCHERROR', field_type=DoubleType(), required=False), NestedField(field_id=72, name='LOWER5MINREQ', field_type=DoubleType(), required=False), NestedField(field_id=99, name='LOWER6SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=56, name='LOWER5MINVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=9, name='RAISE60SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=87, name='LOWERREGACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=52, name='LOWER60SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=111, name='ROP', field_type=DoubleType(), required=False), NestedField(field_id=19, name='RAISE6SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=107, name='NETINTERCHANGE', field_type=DoubleType(), required=False), NestedField(field_id=39, name='LOWERREGLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=102, name='LOWER60SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=95, name='LOWERREGVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=53, name='LOWER5MINLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=8, name='RAISE60SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=63, name='RAISE5MINACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=112, name='RAISE5MINRRP', field_type=DoubleType(), required=False), NestedField(field_id=120, name='RAISE6SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=84, name='LOWER6SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=92, name='EXCESSGENERATION', field_type=DoubleType(), required=False), NestedField(field_id=61, name='RAISE60SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=58, name='RAISE5MINPRICE', field_type=DoubleType(), required=False), NestedField(field_id=54, name='LOWER6SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=22, name='RAISE5MINREQ', field_type=DoubleType(), required=False), NestedField(field_id=29, name='RAISE6SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=4, name='RAISEREGLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=128, name='RAISE5MINLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=101, name='RAISE5MINAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=106, name='LOWER60SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=13, name='VERSION', field_type=DoubleType(), required=False), NestedField(field_id=96, name='LORSURPLUS', field_type=DoubleType(), required=False), NestedField(field_id=86, name='LOWER5MINAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=30, name='RAISE60SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=70, name='RAISEREGACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=97, name='RAISE60SECROP', field_type=DoubleType(), required=False), NestedField(field_id=114, name='RAISE60SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=126, name='LOWERREGIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=108, name='LOWER60SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=42, name='LOWERREGLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=74, name='RAISE5MINDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=94, name='LOWER5MINROP', field_type=DoubleType(), required=False), NestedField(field_id=90, name='LOWERREGLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=46, name='RAISE5MINIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=15, name='RAISE60SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=122, name='LOWER60SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=118, name='RAISE60SECLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=88, name='LOWER6SECROP', field_type=DoubleType(), required=False), NestedField(field_id=48, name='LRCSURPLUS', field_type=DoubleType(), required=False), NestedField(field_id=6, name='APCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=38, name='LOWER5MINLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=89, name='RAISE6SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=50, name='CLEAREDSUPPLY', field_type=DoubleType(), required=False), NestedField(field_id=11, name='RAISE6SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=21, name='LOWER5MINSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=124, name='LOWER5MINACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=62, name='RAISEREGROP', field_type=DoubleType(), required=False), NestedField(field_id=127, name='RAISE6SECROP', field_type=DoubleType(), required=False), NestedField(field_id=32, name='RAISEREGRRP', field_type=DoubleType(), required=False), NestedField(field_id=24, name='INITIALSUPPLY', field_type=DoubleType(), required=False), NestedField(field_id=123, name='LOWER6SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=81, name='RAISE6SECAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=34, name='RAISE5MINLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=60, name='LOWERREGROP', field_type=DoubleType(), required=False), NestedField(field_id=116, name='RAISE60SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=45, name='RAISE6SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=16, name='LOWER5MINIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=20, name='RAISE6SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=66, name='INTERVENTION', field_type=DoubleType(), required=False), NestedField(field_id=117, name='LOWER60SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=12, name='RAISE60SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=73, name='LOWER6SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=35, name='RAISE6SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=26, name='DISPATCHABLEGENERATION', field_type=DoubleType(), required=False), NestedField(field_id=27, name='LOWER60SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=55, name='LOWER6SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=57, name='RRP', field_type=DoubleType(), required=False), NestedField(field_id=69, name='LOWER60SECSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=68, name='LOWER6SECREQ', field_type=DoubleType(), required=False), NestedField(field_id=105, name='RAISE5MINSUPPLYPRICE', field_type=DoubleType(), required=False), NestedField(field_id=103, name='LOWER60SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=109, name='DISPATCHABLELOAD', field_type=DoubleType(), required=False), NestedField(field_id=104, name='RAISE6SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=41, name='LOWER60SECLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=78, name='LOWER6SECACTUALAVAILABILITY', field_type=DoubleType(), required=False), NestedField(field_id=91, name='LOWER5MINLOCALVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=31, name='LOWER60SECPRICE', field_type=DoubleType(), required=False), NestedField(field_id=77, name='LOWER6SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=100, name='RAISE5MINROP', field_type=DoubleType(), required=False), NestedField(field_id=7, name='LOWER60SECROP', field_type=DoubleType(), required=False), NestedField(field_id=10, name='AVAILABLELOAD', field_type=DoubleType(), required=False), NestedField(field_id=14, name='TOTALDEMAND', field_type=DoubleType(), required=False), NestedField(field_id=18, name='LOWERREGRRP', field_type=DoubleType(), required=False), NestedField(field_id=71, name='LOWER60SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=43, name='RAISE60SECDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=125, name='LOWER5MINPRICE', field_type=DoubleType(), required=False), NestedField(field_id=44, name='RAISE5MINLOCALPRICE', field_type=DoubleType(), required=False), NestedField(field_id=113, name='LOWER6SECIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=17, name='LOWER5MINRRP', field_type=DoubleType(), required=False), NestedField(field_id=47, name='RAISE60SECVIOLATION', field_type=DoubleType(), required=False), NestedField(field_id=51, name='RAISEREGLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=93, name='RAISEREGIMPORT', field_type=DoubleType(), required=False), NestedField(field_id=40, name='LOWER6SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=37, name='LOWERREGAPCFLAG', field_type=DoubleType(), required=False), NestedField(field_id=64, name='RAISEREGREQ', field_type=DoubleType(), required=False), NestedField(field_id=65, name='RAISE60SECLOCALREQ', field_type=DoubleType(), required=False), NestedField(field_id=59, name='RAISE6SECRRP', field_type=DoubleType(), required=False), NestedField(field_id=98, name='RAISE6SECLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=49, name='LOWER5MINLOCALDISPATCH', field_type=DoubleType(), required=False), NestedField(field_id=129, name='SETTLEMENTDATE', field_type=TimestampType(), required=False), NestedField(field_id=130, name='file', field_type=StringType(), required=False), NestedField(field_id=131, name='YEAR', field_type=LongType(), required=False))

Notice the NestedField(field_id=4 versus NestedField(field_id=79.

And since fields is defined with Tuple, the order matters! https://github.com/apache/iceberg-python/blob/7bd5d9e6c32bcc5b46993d6bfaeed50471e972ae/pyiceberg/schema.py#L88

kevinjqliu commented 2 months ago

In the "Price" cell, df.schema contains columns with different ordering than the Iceberg table schema.

This should be a PyIceberg bug, the ordering of fields in the schema should not matter.

kevinjqliu commented 2 months ago

Oh btw, a workaround before the fix is merged is to reorder the schemas to match manually :) Can sort by field_id, similar to the fix

djouallah commented 2 months ago

Thanks there is another case ? what if a user want to append only a subsets of columns, in Delta table, the system will just add nulls to the missing columns, is this something that iceberg support ?

kevinjqliu commented 2 months ago

Looks like we're able to write (append/overwrite) a subset of schema, here's a PR with a test https://github.com/apache/iceberg-python/pull/704

This feature is a side effect of #523

Fokko commented 1 month ago

Hey @kevinjqliu Thanks for following up here, however, I don't think that https://github.com/apache/iceberg-python/pull/700 is the appropriate fix. Instead, we should re-order the fields based on the table schema. I think we're quite close to that.

The to_requested_schema will do this for us: https://github.com/apache/iceberg-python/blob/31c6c23d428a3237589ebada2b4cd64bf37b1aef/pyiceberg/io/pyarrow.py#L1796

I think we should do two things:

WDYT?

kevinjqliu commented 2 weeks ago

@Fokko, coming back to this. I think your first comment is already addressed in #807 (thanks @syun64). Your second comment is implemented in #829. Please take a look!