aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 701 forks source link

pyarrow.lib.ArrowInvalid: Schema at index 0 was different reading data via s3.read_parquet using the paramters chunked and columns #2433

Closed KaroPy closed 10 months ago

KaroPy commented 1 year ago

Describe the bug

Hello community,

I'm currently attempting to load Parquet files using awswrangler version 3.3.0 and the s3.read_parquet function, with the parameters chunked=True and specifying the specific columns I wish to load. While the initial reading of the data seems to work correctly, I encounter an error when attempting to concatenate the loaded chunks together. Interestingly, when I omit the columns parameter, the concatenation process functions without any issues.

To illustrate this problem and make it easier for the community to assist, I've provided a reproducible example using a public dataset. The second part results in the following error message:

Traceback (most recent call last):
  File "/home/karoline/Innkeepr/Git/prefect-flows-etl/error_reading_parquet.py", line 17, in <module>
    for df in df_selective:
  File "/home/karoline/.local/share/virtualenvs/prefect-flows-etl-94aQ17Of/lib/python3.10/site-packages/awswrangler/s3/_read_parquet.py", line 243, in _read_parquet_chunked
    table=pa.Table.from_batches(chunks, schema=pq_file.schema.to_arrow_schema()),
  File "pyarrow/table.pxi", line 3979, in pyarrow.lib.Table.from_batches
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Schema at index 0 was different: 
quadkey: string
tile: string
avg_d_kbps: int64
avg_u_kbps: int64
avg_lat_ms: int64
avg_lat_down_ms: int32
avg_lat_up_ms: int32
tests: int64
devices: int64
vs
avg_d_kbps: int64
avg_u_kbps: int64
avg_lat_ms: int64

How to Reproduce

import awswrangler as wr
import pandas as pd
print(f"awswrangler version: {wr.__version__}")

path = "s3://ookla-open-data/parquet/performance/type=fixed/year=2023/quarter=1/2023-01-01_performance_fixed_tiles.parquet"

# read parquet file in chunks
df_chunked = wr.s3.read_parquet(path,chunked=True)
df_concat = pd.DataFrame()
for df in df_chunked:
    df_concat = pd.concat([df_concat, df])
print(f"read all data = {df_concat.shape}")

# read parquet file with specific columns resulting in an error
df_selective = wr.s3.read_parquet(path,chunked=True,columns=["avg_d_kbps","avg_u_kbps","avg_lat_ms"])
df_combine = pd.DataFrame()
for df in df_selective:
    df_combine = pd.concat([df_combine, df])
print(f"read data using certain columns  = {df_combine.shape}")

Expected behavior

In Version 3.1 I can concatenate the chunks without any errors using the parameter columns.

Your project

No response

Screenshots

No response

OS

Linux

Python version

Python 3.10.12

AWS SDK for pandas version

2.0.3

Additional context

No response

kukushking commented 1 year ago

Hi @KaroPy what is the version of pyarrow in both cases? You can check the version via:

import pyarrow as pa

print(f"pyarrow version: {pa.__version__}")
KaroPy commented 1 year ago

Hey @kukushking , the pyarrow version is 12.0.1

SantiagoBrenes12 commented 1 year ago

Have been this addressed? I'm having the same issue.

MorganLu commented 1 year ago

Have the same issue with awswrangler-3.4.0 + pyarrow-13.0.0. But it works properly with awswrangler-3.2.1 + pyarrow-13.0.0. Is it possible that the root cause was because the PR here adding schema checks? https://github.com/aws/aws-sdk-pandas/pull/2400/files

mmilosav commented 1 year ago

Same issue with awswrangler-3.4.1 + pyarrow-12.0.1.

irowberryFS commented 11 months ago

Also getting the same issue with awswrangler-3.4.2 and pyarrow-14.0.1

rchromik commented 10 months ago

I have same issue with awswrangler-3.4.2 and pyarrow-14.0.2

Executing this line of code:

data_frames = wr.s3.read_parquet(path=transaction_parquet_s3_url, chunked=True, columns=row_builder.columns_to_load_from_avro)

causes this error:

C:\Users\rchromik\AppData\Local\Programs\Python\Python310\lib\site-packages\awswrangler\s3\_read_parquet.py:243: in _read_parquet_chunked
    table=pa.Table.from_batches(chunks, schema=pq_file.schema.to_arrow_schema()),
pyarrow\table.pxi:4057: in pyarrow.lib.Table.from_batches
    ???
pyarrow\error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   pyarrow.lib.ArrowInvalid: Schema at index 0 was different: 
E   id: int64
E   txn_0__month: int32
E   txn_1__quarter: int32
E   txn_2__year: int32
E   txn_3__soldtocustomerid: string
E   txn_4__vuser: string
E   txn_5__shiptocustomerid: string
E   txn_6__businessunitid: string
E   txn_7__productlineid: string
E   txn_8__productfamilyid: string
E   txn_9__productsubfamilyid: string
E   txn_10__vuomlabel: string
E   txn_11__stateprovinceid: string
E   txn_12__industry: string
E   txn_13__shipto: string
E   txn_14__salesorganizationid: string
E   txn_15__competitor: string
E   txn_16__customergroup: string
E   txn_17__customermaterialnumber: string
E   txn_18__customerpricegroup: string
E   txn_19__customerpricelisttype: string
E   txn_20__departurecountry: string
E   txn_21__destinationcountry: string
E   txn_22__distributorid: string
E   txn_23__distributionchannel: string
E   txn_24__division: string
E   txn_25__exchangeratetype: string
E   txn_26__incoterms: string
E   txn_27__materialpricinggroup: string
E   txn_28__paymenttermscode: string
E   txn_29__plant: string
E   txn_30__vpricingdate: date32[day]
E   txn_31__shippingconditions: string
E   txn_32__shippingmethod: string
E   txn_33__termsofpayment: string
E   txn_34__transactiontype: string
E   txn_35__transactiontypelabel: string
E   txn_36__week: int32
E   txn_37__vstartdate: date32[day]
E   txn_38__venddate: date32[day]
E   txn_39__vpricingcurrency: string
E   txn_40__marketsegmenttype: string
E   txn_41__productattr1: string
E   txn_42__productattr2: string
E   txn_43__productbaseuom: string
E   txn_44__productgrade: string
E   txn_45__producttype: string
E   txn_46__package: string
E   txn_47__segment: string
E   txn_48__productid: string
E   txn_49__salesgroupid: string
E   txn_50__custprofitabilityclassvalue: string
E   txn_51__custrevenueclassvalue: string
E   txn_52__custvaluequadrantclassvalue: string
E   txn_53__custbighatnocattleclassvalue: string
E   txn_54__custcherrypickerclassvalue: string
E   txn_55__prodlifecycleclassvalue: string
E   txn_56__prodvolumeclassvalue: string
E   txn_57__prodprofitabilityclassvalue: string
E   txn_58__prodrevenueclassvalue: string
E   txn_59__prodcompclassvalue: string
E   txn_60__prodcostclassvalue: string
E   txn_61__dealitemrecordvid: string
E   txn_62__dealid: string
E   txn_63__basedealid: string
E   txn_64__deallineid: string
E   txn_65__distributor: string
E   txn_66__truesupportedprice: double
E   txn_67__avgpocketmargin: double
E   txn_68__covarcustpriceperunita: double
E   txn_69__covarinvoicepriceperunita: double
E   txn_70__covarmarginperunita: double
E   txn_71__competitorprice: double
E   txn_72__customerdiscoffmarket: double
E   txn_73__listpriceyield: double
E   txn_74__diffofrevenueandprofit: double
E   txn_75__freightopportunityamount: double
E   txn_76__freightopportunityabsorption: double
E   txn_77__freightrecoveryratio: double
E   txn_78__grossmargin: double
E   txn_79__invoicepriceyield: double
E   txn_80__marketpriceyield: double
E   txn_81__pocketmarginyield: double
E   txn_82__pocketmarginpercentofpriceprotection: double
E   txn_83__productcount: double
E   txn_84__quantitypricecorrelation: double
E   txn_85__regionaldiscoffrefprice: double
E   txn_86__regionalpricerealization: double
E   txn_87__servicectsopportunityamount: double
E   txn_88__serviceabsorptionratio: double
E   txn_89__servicectsrecoveryratio: double
E   txn_90__standarddevpriceperunita: double
E   txn_91__stddevtruesupportedprice: double
E   txn_92__stddevpocketmarginunit: double
E   txn_93__stddevpocketprice: double
E   txn_94__targetprice: double
E   txn_95__targetpriceyield: double
E   txn_96__pocketmarginpercentofrevenue: double
E   txn_97__pocketmarginpercentofinvoiceprice: double
E   txn_98__pocketmarginpercentofnetprice: double
E   txn_99__totalcost: double
E   txn_100__totalcostandrebates: double
E   txn_101__causalityvariablecost: double
E   txn_102__cogs: double
E   txn_103__totalcts: double
E   txn_104__totalctsopportunity: double
E   txn_105__totalctsrecovered: double
E   txn_106__totalctsrecovery: double
E   txn_107__totaldiscountoffmarket: double
E   txn_108__totalrebates: double
E   txn_109__transactioncount: double
E   txn_110__covartruesupportedpriceperunita: double
E   txn_111__productweight: int64
E   txn_112__originalquantity: double
E   txn_113__dealtype: string
E   txn_114__dealtypelabel: string
E   txn_115__trafficlightthrfound: string
E   txn_116__trafficlighthighthrspecific: double
E   txn_117__trafficlightlowthrspecific: double
E   txn_118__trafficlighthighthrdefault: double
E   txn_119__trafficlightlowthrdefault: double
E   txn_120__trafficlighthighthr: double
E   txn_121__trafficlightlowthr: double
E   txn_122__wtrafficlighthighthr: double
E   txn_123__wtrafficlightlowthr: double
E   txn_124__trafficlighttransformedpocketmarginpct: double
E   txn_125__pptype: string
E   txn_126__ppdelayedrollingmonthsperiod: string
E   txn_127__ppfirmthroughperiod: string
E   txn_128__ppdelayedperiod: double
E   txn_129__pphas: string
E   txn_130__pphaslabel: string
E   txn_131__ppapplicablefor: string
E   txn_132__pchas: string
E   txn_133__pchaslabel: string
E   txn_134__pcperiod: string
E   txn_135__pcoverallperiod: string
E   txn_136__pppcbypass: string
E   txn_137__pppcbypasslabel: string
E   txn_138__dqa: string
E   txn_139__power: double
E   txn_140__segmentpower: double
E   txn_141__risk: double
E   txn_142__segmentrisk: double
E   txn_143__segmentvaliditytype: string
E   txn_144__marketdemand: double
E   txn_145__avgmarketdemand: double
E   txn_146__industryrisk: double
E   txn_147__avgindustryrisk: double
E   txn_148__segmentstrategy: string
E   txn_149__reversecost: double
E   txn_150__segmentdesc: string
E   txn_151__reverserisk: double
E   txn_152__segmentheat: double
E   txn_153__vcorrelated_deal_transactions: string
E   txn_154__vbasepricepercent: double
E   txn_155__vbasepriceperunit: double
E   txn_156__productvalueadjpercent: double
E   txn_157__productvalueadjperunit: double
E   txn_158__strategyadjpercent: double
E   txn_159__strategyadjperunit: double
E   txn_160__referencemanualoverridepercent: double
E   txn_161__referencemanualoverrideperunit: double
E   txn_162__vlistpricepercent: double
E   txn_163__vlistpriceperunit: double
E   txn_164__regionaladjpercent: double
E   txn_165__regionaladjperunit: double
E   txn_166__regionalmanualoverridepercent: double
E   txn_167__regionalmanualoverrideperunit: double
E   txn_168__regionalpricepercent: double
E   txn_169__regionalpriceperunit: double
E   txn_170__countryadjpercent: double
E   txn_171__countryadjperunit: double
E   txn_172__pricecontroladjpercent: double
E   txn_173__pricecontroladjperunit: double
E   txn_174__countrymanualoverridepercent: double
E   txn_175__countrymanualoverrideperunit: double
E   txn_176__countrypricepercent: double
E   txn_177__countrypriceperunit: double
E   txn_178__channeladjpercent: double
E   txn_179__channeladjperunit: double
E   txn_180__marketsegmentadjpercent: double
E   txn_181__marketsegmentadjperunit: double
E   txn_182__marketpricepercent: double
E   txn_183__marketpriceperunit: double
E   txn_184__negotiateddiscountpercent: double
E   txn_185__negotiateddiscountperunit: double
E   txn_186__customerpricepercent: double
E   txn_187__customerpriceperunit: double
E   txn_188__priceprotectioncapcostpercent: double
E   txn_189__priceprotectioncapcostperunit: double
E   txn_190__customernegotiatedpricepercent: double
E   txn_191__customernegotiatedpriceperunit: double
E   txn_192__volumediscountpercent: double
E   txn_193__volumediscountperunit: double
E   txn_194__promotionaldiscountpercent: double
E   txn_195__promotionaldiscountperunit: double
E   txn_196__freightchargepercent: double
E   txn_197__freightchargeperunit: double
E   txn_198__surchargespercent: double
E   txn_199__surchargesperunit: double
E   txn_200__servicechargespercent: double
E   txn_201__servicechargesperunit: double
E   txn_202__manualadjustmentspercent: double
E   txn_203__manualadjustmentsperunit: double
E   txn_204__vinvoicepricepercent: double
E   txn_205__vinvoicepriceperunit: double
E   txn_206__shipanddebitadjpercent: double
E   txn_207__shipanddebitadjperunit: double
E   txn_208__supportedpricepercent: double
E   txn_209__supportedpriceperunit: double
E   txn_210__distirebatepercent: double
E   txn_211__distirebateperunit: double
E   txn_212__endcustrebatespercent: double
E   txn_213__endcustrebatesperunit: double
E   txn_214__mdfcooppercent: double
E   txn_215__mdfcoopperunit: double
E   txn_216__netpricepercent: double
E   txn_217__netpriceperunit: double
E   txn_218__paymentcostpercent: double
E   txn_219__paymentcostperunit: double
E   txn_220__freightcostpercent: double
E   txn_221__freightcostperunit: double
E   txn_222__servicecostpercent: double
E   txn_223__servicecostperunit: double
E   txn_224__vpocketpricepercent: double
E   txn_225__vpocketpriceperunit: double
E   txn_226__variablecogspercent: double
E   txn_227__variablecogsperunit: double
E   txn_228__variablepocketmarginpercent: double
E   txn_229__variablepocketmarginperunit: double
E   txn_230__fixedmfgcostpercent: double
E   txn_231__fixedmfgcostperunit: double
E   txn_232__vpocketmarginpercent: double
E   txn_233__vpocketmarginperunit: double
E   txn_234__truesupportedpricepercent: double
E   txn_235__truesupportedpriceperunit: double
E   txn_236__competitorpricepercent: double
E   txn_237__competitorpriceperunit: double
E   txn_238__targetpricepercent: double
E   txn_239__targetpriceperunit: double
E   txn_240__totalcostandrebatespercent: double
E   txn_241__totalcostandrebatesperunit: double
E   txn_242__causalityvariablecostpercent: double
E   txn_243__causalityvariablecostperunit: double
E   txn_244__reversecostpercent: double
E   txn_245__reversecostperunit: double
E   txn_246__vbaseprice: double
E   txn_247__productvalueadj: double
E   txn_248__strategyadj: double
E   txn_249__referencemanualoverride: double
E   txn_250__vlistprice: double
E   txn_251__regionaladj: double
E   txn_252__regionalmanualoverride: double
E   txn_253__regionalprice: double
E   txn_254__countryadj: double
E   txn_255__pricecontroladj: double
E   txn_256__countrymanualoverride: double
E   txn_257__countryprice: double
E   txn_258__channeladj: double
E   txn_259__marketsegmentadj: double
E   txn_260__marketprice: double
E   txn_261__negotiateddiscount: double
E   txn_262__customerprice: double
E   txn_263__priceprotectioncapcost: double
E   txn_264__customernegotiatedprice: double
E   txn_265__volumediscount: double
E   txn_266__promotionaldiscount: double
E   txn_267__freightcharge: double
E   txn_268__surcharges: double
E   txn_269__servicecharges: double
E   txn_270__manualadjustments: double
E   txn_271__vinvoiceprice: double
E   txn_272__shipanddebitadj: double
E   txn_273__supportedprice: double
E   txn_274__distirebate: double
E   txn_275__endcustrebates: double
E   txn_276__mdfcoop: double
E   txn_277__netprice: double
E   txn_278__paymentcost: double
E   txn_279__freightcost: double
E   txn_280__servicecost: double
E   txn_281__vpocketprice: double
E   txn_282__variablecogs: double
E   txn_283__variablepocketmargin: double
E   txn_284__fixedmfgcost: double
E   txn_285__vpocketmargin: double
E   txn_286__productlevel1: string
E   txn_287__productlevel2: string
E   txn_288__productlevel3: string
E   txn_289__productlevel4: string
E   txn_290__product: string
E   txn_291__customerlevel1: string
E   txn_292__customerlevel2: string
E   txn_293__soldto: string
E   txn_294__saleslevel1: string
E   txn_295__saleslevel2: string
E   txn_296__salesorg: string
E   txn_297__saleslevel4: string
E   txn_298__saleslevel5: string
E   txn_299__geographylevel1: string
E   txn_300__geographylevel2: string
E   txn_301__geographylevel3: string
E   txn_302__geographylevel4: string
E   txn_303__geographylevel5: string
E   txn_304__transactiondate: date32[day]
E   txn_305__uom: string
E   txn_306__transactionid: string
E   txn_307__quantity: double
E   vs
E   txn_299__geographylevel1: string
E   txn_305__uom: string
E   txn_302__geographylevel4: string
E   txn_275__endcustrebates: double
E   txn_303__geographylevel5: string
E   txn_290__product: string
E   txn_107__totaldiscountoffmarket: double
E   txn_282__variablecogs: double
E   txn_278__paymentcost: double
E   txn_271__vinvoiceprice: double
E   txn_296__salesorg: string
E   txn_276__mdfcoop: double
E   txn_279__freightcost: double
E   txn_301__geographylevel3: string
E   txn_300__geographylevel2: string
E   txn_39__vpricingcurrency: string
E   txn_294__saleslevel1: string
E   txn_272__shipanddebitadj: double
E   txn_284__fixedmfgcost: double
E   txn_304__transactiondate: date32[day]
E   id: int64
E   txn_286__productlevel1: string
E   txn_274__distirebate: double
E   txn_307__quantity: double
E   txn_280__servicecost: double
E   txn_287__productlevel2: string
E   txn_295__saleslevel2: string