Closed KaroPy closed 10 months ago
Hi @KaroPy what is the version of pyarrow in both cases? You can check the version via:
import pyarrow as pa
print(f"pyarrow version: {pa.__version__}")
Hey @kukushking , the pyarrow version is 12.0.1
Have been this addressed? I'm having the same issue.
Have the same issue with awswrangler-3.4.0 + pyarrow-13.0.0. But it works properly with awswrangler-3.2.1 + pyarrow-13.0.0. Is it possible that the root cause was because the PR here adding schema checks? https://github.com/aws/aws-sdk-pandas/pull/2400/files
Same issue with awswrangler-3.4.1 + pyarrow-12.0.1.
Also getting the same issue with awswrangler-3.4.2 and pyarrow-14.0.1
I have same issue with awswrangler-3.4.2 and pyarrow-14.0.2
Executing this line of code:
data_frames = wr.s3.read_parquet(path=transaction_parquet_s3_url, chunked=True, columns=row_builder.columns_to_load_from_avro)
causes this error:
C:\Users\rchromik\AppData\Local\Programs\Python\Python310\lib\site-packages\awswrangler\s3\_read_parquet.py:243: in _read_parquet_chunked
table=pa.Table.from_batches(chunks, schema=pq_file.schema.to_arrow_schema()),
pyarrow\table.pxi:4057: in pyarrow.lib.Table.from_batches
???
pyarrow\error.pxi:154: in pyarrow.lib.pyarrow_internal_check_status
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> ???
E pyarrow.lib.ArrowInvalid: Schema at index 0 was different:
E id: int64
E txn_0__month: int32
E txn_1__quarter: int32
E txn_2__year: int32
E txn_3__soldtocustomerid: string
E txn_4__vuser: string
E txn_5__shiptocustomerid: string
E txn_6__businessunitid: string
E txn_7__productlineid: string
E txn_8__productfamilyid: string
E txn_9__productsubfamilyid: string
E txn_10__vuomlabel: string
E txn_11__stateprovinceid: string
E txn_12__industry: string
E txn_13__shipto: string
E txn_14__salesorganizationid: string
E txn_15__competitor: string
E txn_16__customergroup: string
E txn_17__customermaterialnumber: string
E txn_18__customerpricegroup: string
E txn_19__customerpricelisttype: string
E txn_20__departurecountry: string
E txn_21__destinationcountry: string
E txn_22__distributorid: string
E txn_23__distributionchannel: string
E txn_24__division: string
E txn_25__exchangeratetype: string
E txn_26__incoterms: string
E txn_27__materialpricinggroup: string
E txn_28__paymenttermscode: string
E txn_29__plant: string
E txn_30__vpricingdate: date32[day]
E txn_31__shippingconditions: string
E txn_32__shippingmethod: string
E txn_33__termsofpayment: string
E txn_34__transactiontype: string
E txn_35__transactiontypelabel: string
E txn_36__week: int32
E txn_37__vstartdate: date32[day]
E txn_38__venddate: date32[day]
E txn_39__vpricingcurrency: string
E txn_40__marketsegmenttype: string
E txn_41__productattr1: string
E txn_42__productattr2: string
E txn_43__productbaseuom: string
E txn_44__productgrade: string
E txn_45__producttype: string
E txn_46__package: string
E txn_47__segment: string
E txn_48__productid: string
E txn_49__salesgroupid: string
E txn_50__custprofitabilityclassvalue: string
E txn_51__custrevenueclassvalue: string
E txn_52__custvaluequadrantclassvalue: string
E txn_53__custbighatnocattleclassvalue: string
E txn_54__custcherrypickerclassvalue: string
E txn_55__prodlifecycleclassvalue: string
E txn_56__prodvolumeclassvalue: string
E txn_57__prodprofitabilityclassvalue: string
E txn_58__prodrevenueclassvalue: string
E txn_59__prodcompclassvalue: string
E txn_60__prodcostclassvalue: string
E txn_61__dealitemrecordvid: string
E txn_62__dealid: string
E txn_63__basedealid: string
E txn_64__deallineid: string
E txn_65__distributor: string
E txn_66__truesupportedprice: double
E txn_67__avgpocketmargin: double
E txn_68__covarcustpriceperunita: double
E txn_69__covarinvoicepriceperunita: double
E txn_70__covarmarginperunita: double
E txn_71__competitorprice: double
E txn_72__customerdiscoffmarket: double
E txn_73__listpriceyield: double
E txn_74__diffofrevenueandprofit: double
E txn_75__freightopportunityamount: double
E txn_76__freightopportunityabsorption: double
E txn_77__freightrecoveryratio: double
E txn_78__grossmargin: double
E txn_79__invoicepriceyield: double
E txn_80__marketpriceyield: double
E txn_81__pocketmarginyield: double
E txn_82__pocketmarginpercentofpriceprotection: double
E txn_83__productcount: double
E txn_84__quantitypricecorrelation: double
E txn_85__regionaldiscoffrefprice: double
E txn_86__regionalpricerealization: double
E txn_87__servicectsopportunityamount: double
E txn_88__serviceabsorptionratio: double
E txn_89__servicectsrecoveryratio: double
E txn_90__standarddevpriceperunita: double
E txn_91__stddevtruesupportedprice: double
E txn_92__stddevpocketmarginunit: double
E txn_93__stddevpocketprice: double
E txn_94__targetprice: double
E txn_95__targetpriceyield: double
E txn_96__pocketmarginpercentofrevenue: double
E txn_97__pocketmarginpercentofinvoiceprice: double
E txn_98__pocketmarginpercentofnetprice: double
E txn_99__totalcost: double
E txn_100__totalcostandrebates: double
E txn_101__causalityvariablecost: double
E txn_102__cogs: double
E txn_103__totalcts: double
E txn_104__totalctsopportunity: double
E txn_105__totalctsrecovered: double
E txn_106__totalctsrecovery: double
E txn_107__totaldiscountoffmarket: double
E txn_108__totalrebates: double
E txn_109__transactioncount: double
E txn_110__covartruesupportedpriceperunita: double
E txn_111__productweight: int64
E txn_112__originalquantity: double
E txn_113__dealtype: string
E txn_114__dealtypelabel: string
E txn_115__trafficlightthrfound: string
E txn_116__trafficlighthighthrspecific: double
E txn_117__trafficlightlowthrspecific: double
E txn_118__trafficlighthighthrdefault: double
E txn_119__trafficlightlowthrdefault: double
E txn_120__trafficlighthighthr: double
E txn_121__trafficlightlowthr: double
E txn_122__wtrafficlighthighthr: double
E txn_123__wtrafficlightlowthr: double
E txn_124__trafficlighttransformedpocketmarginpct: double
E txn_125__pptype: string
E txn_126__ppdelayedrollingmonthsperiod: string
E txn_127__ppfirmthroughperiod: string
E txn_128__ppdelayedperiod: double
E txn_129__pphas: string
E txn_130__pphaslabel: string
E txn_131__ppapplicablefor: string
E txn_132__pchas: string
E txn_133__pchaslabel: string
E txn_134__pcperiod: string
E txn_135__pcoverallperiod: string
E txn_136__pppcbypass: string
E txn_137__pppcbypasslabel: string
E txn_138__dqa: string
E txn_139__power: double
E txn_140__segmentpower: double
E txn_141__risk: double
E txn_142__segmentrisk: double
E txn_143__segmentvaliditytype: string
E txn_144__marketdemand: double
E txn_145__avgmarketdemand: double
E txn_146__industryrisk: double
E txn_147__avgindustryrisk: double
E txn_148__segmentstrategy: string
E txn_149__reversecost: double
E txn_150__segmentdesc: string
E txn_151__reverserisk: double
E txn_152__segmentheat: double
E txn_153__vcorrelated_deal_transactions: string
E txn_154__vbasepricepercent: double
E txn_155__vbasepriceperunit: double
E txn_156__productvalueadjpercent: double
E txn_157__productvalueadjperunit: double
E txn_158__strategyadjpercent: double
E txn_159__strategyadjperunit: double
E txn_160__referencemanualoverridepercent: double
E txn_161__referencemanualoverrideperunit: double
E txn_162__vlistpricepercent: double
E txn_163__vlistpriceperunit: double
E txn_164__regionaladjpercent: double
E txn_165__regionaladjperunit: double
E txn_166__regionalmanualoverridepercent: double
E txn_167__regionalmanualoverrideperunit: double
E txn_168__regionalpricepercent: double
E txn_169__regionalpriceperunit: double
E txn_170__countryadjpercent: double
E txn_171__countryadjperunit: double
E txn_172__pricecontroladjpercent: double
E txn_173__pricecontroladjperunit: double
E txn_174__countrymanualoverridepercent: double
E txn_175__countrymanualoverrideperunit: double
E txn_176__countrypricepercent: double
E txn_177__countrypriceperunit: double
E txn_178__channeladjpercent: double
E txn_179__channeladjperunit: double
E txn_180__marketsegmentadjpercent: double
E txn_181__marketsegmentadjperunit: double
E txn_182__marketpricepercent: double
E txn_183__marketpriceperunit: double
E txn_184__negotiateddiscountpercent: double
E txn_185__negotiateddiscountperunit: double
E txn_186__customerpricepercent: double
E txn_187__customerpriceperunit: double
E txn_188__priceprotectioncapcostpercent: double
E txn_189__priceprotectioncapcostperunit: double
E txn_190__customernegotiatedpricepercent: double
E txn_191__customernegotiatedpriceperunit: double
E txn_192__volumediscountpercent: double
E txn_193__volumediscountperunit: double
E txn_194__promotionaldiscountpercent: double
E txn_195__promotionaldiscountperunit: double
E txn_196__freightchargepercent: double
E txn_197__freightchargeperunit: double
E txn_198__surchargespercent: double
E txn_199__surchargesperunit: double
E txn_200__servicechargespercent: double
E txn_201__servicechargesperunit: double
E txn_202__manualadjustmentspercent: double
E txn_203__manualadjustmentsperunit: double
E txn_204__vinvoicepricepercent: double
E txn_205__vinvoicepriceperunit: double
E txn_206__shipanddebitadjpercent: double
E txn_207__shipanddebitadjperunit: double
E txn_208__supportedpricepercent: double
E txn_209__supportedpriceperunit: double
E txn_210__distirebatepercent: double
E txn_211__distirebateperunit: double
E txn_212__endcustrebatespercent: double
E txn_213__endcustrebatesperunit: double
E txn_214__mdfcooppercent: double
E txn_215__mdfcoopperunit: double
E txn_216__netpricepercent: double
E txn_217__netpriceperunit: double
E txn_218__paymentcostpercent: double
E txn_219__paymentcostperunit: double
E txn_220__freightcostpercent: double
E txn_221__freightcostperunit: double
E txn_222__servicecostpercent: double
E txn_223__servicecostperunit: double
E txn_224__vpocketpricepercent: double
E txn_225__vpocketpriceperunit: double
E txn_226__variablecogspercent: double
E txn_227__variablecogsperunit: double
E txn_228__variablepocketmarginpercent: double
E txn_229__variablepocketmarginperunit: double
E txn_230__fixedmfgcostpercent: double
E txn_231__fixedmfgcostperunit: double
E txn_232__vpocketmarginpercent: double
E txn_233__vpocketmarginperunit: double
E txn_234__truesupportedpricepercent: double
E txn_235__truesupportedpriceperunit: double
E txn_236__competitorpricepercent: double
E txn_237__competitorpriceperunit: double
E txn_238__targetpricepercent: double
E txn_239__targetpriceperunit: double
E txn_240__totalcostandrebatespercent: double
E txn_241__totalcostandrebatesperunit: double
E txn_242__causalityvariablecostpercent: double
E txn_243__causalityvariablecostperunit: double
E txn_244__reversecostpercent: double
E txn_245__reversecostperunit: double
E txn_246__vbaseprice: double
E txn_247__productvalueadj: double
E txn_248__strategyadj: double
E txn_249__referencemanualoverride: double
E txn_250__vlistprice: double
E txn_251__regionaladj: double
E txn_252__regionalmanualoverride: double
E txn_253__regionalprice: double
E txn_254__countryadj: double
E txn_255__pricecontroladj: double
E txn_256__countrymanualoverride: double
E txn_257__countryprice: double
E txn_258__channeladj: double
E txn_259__marketsegmentadj: double
E txn_260__marketprice: double
E txn_261__negotiateddiscount: double
E txn_262__customerprice: double
E txn_263__priceprotectioncapcost: double
E txn_264__customernegotiatedprice: double
E txn_265__volumediscount: double
E txn_266__promotionaldiscount: double
E txn_267__freightcharge: double
E txn_268__surcharges: double
E txn_269__servicecharges: double
E txn_270__manualadjustments: double
E txn_271__vinvoiceprice: double
E txn_272__shipanddebitadj: double
E txn_273__supportedprice: double
E txn_274__distirebate: double
E txn_275__endcustrebates: double
E txn_276__mdfcoop: double
E txn_277__netprice: double
E txn_278__paymentcost: double
E txn_279__freightcost: double
E txn_280__servicecost: double
E txn_281__vpocketprice: double
E txn_282__variablecogs: double
E txn_283__variablepocketmargin: double
E txn_284__fixedmfgcost: double
E txn_285__vpocketmargin: double
E txn_286__productlevel1: string
E txn_287__productlevel2: string
E txn_288__productlevel3: string
E txn_289__productlevel4: string
E txn_290__product: string
E txn_291__customerlevel1: string
E txn_292__customerlevel2: string
E txn_293__soldto: string
E txn_294__saleslevel1: string
E txn_295__saleslevel2: string
E txn_296__salesorg: string
E txn_297__saleslevel4: string
E txn_298__saleslevel5: string
E txn_299__geographylevel1: string
E txn_300__geographylevel2: string
E txn_301__geographylevel3: string
E txn_302__geographylevel4: string
E txn_303__geographylevel5: string
E txn_304__transactiondate: date32[day]
E txn_305__uom: string
E txn_306__transactionid: string
E txn_307__quantity: double
E vs
E txn_299__geographylevel1: string
E txn_305__uom: string
E txn_302__geographylevel4: string
E txn_275__endcustrebates: double
E txn_303__geographylevel5: string
E txn_290__product: string
E txn_107__totaldiscountoffmarket: double
E txn_282__variablecogs: double
E txn_278__paymentcost: double
E txn_271__vinvoiceprice: double
E txn_296__salesorg: string
E txn_276__mdfcoop: double
E txn_279__freightcost: double
E txn_301__geographylevel3: string
E txn_300__geographylevel2: string
E txn_39__vpricingcurrency: string
E txn_294__saleslevel1: string
E txn_272__shipanddebitadj: double
E txn_284__fixedmfgcost: double
E txn_304__transactiondate: date32[day]
E id: int64
E txn_286__productlevel1: string
E txn_274__distirebate: double
E txn_307__quantity: double
E txn_280__servicecost: double
E txn_287__productlevel2: string
E txn_295__saleslevel2: string
Describe the bug
Hello community,
I'm currently attempting to load Parquet files using awswrangler version 3.3.0 and the s3.read_parquet function, with the parameters chunked=True and specifying the specific columns I wish to load. While the initial reading of the data seems to work correctly, I encounter an error when attempting to concatenate the loaded chunks together. Interestingly, when I omit the columns parameter, the concatenation process functions without any issues.
To illustrate this problem and make it easier for the community to assist, I've provided a reproducible example using a public dataset. The second part results in the following error message:
How to Reproduce
Expected behavior
In Version 3.1 I can concatenate the chunks without any errors using the parameter columns.
Your project
No response
Screenshots
No response
OS
Linux
Python version
Python 3.10.12
AWS SDK for pandas version
2.0.3
Additional context
No response