databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
496 stars 224 forks source link

PYSPARK Dataframe Not Saved into Hive Properly #262

Closed BramhaAelem closed 7 years ago

BramhaAelem commented 7 years ago

Hi, I am using databricks spark -xml untlity to ingest XML data into hive by using PYSPARK with spark-xml_2.10-0.2.0.jar

Python version 2.7.5 Spark version 1.6.2

ROW tag is root tag of the XML df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='SPARK_TEST_XML').load(hdfsPath) df.write.format("orc").mode("overwrite").saveAsTable("default.test_spark_xml")

PFA XML. Spark_XML.txt

My Job is getting successfully completed. But when I try to describe the table in hive it's giving FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. > expected at the position 4007 of 'struct<AdditionalInsureds:struct<AdditionalInsured:array<struct<Age:bigint,ClientId:string,FirstPI:boolean,Gender:string,InsuranceStatement:struct<PrevPolicy:struct<Amount:bigint,AmtAtRisk:bigint,BenefitAmt:bigint,BenefitType:string,CCRInd:boolean,Class:string,Description:string,ExtraPremAmt:bigint,ExtraPremExp:string,HOAssignedAppNumber:string,InsurabilityInd:boolean,InsuredRole:string,PlanCode:string,PlanVersion:string,PolStatus:string,ProposalDate:string,RecordData:string,Records:string,ReinsInd:boolean,RiskClass:string,TotalRiskAmt:bigint,WPADBInd:boolean>,ReviewedPriorInsInd:boolean,SmokingInd:boolean,TotalAmtInforce:bigint,TotalInforceAndAppliedInsRisk:bigint>,MIB:struct<MIBResponses:struct<MIBResponse:array<struct<HitTryCode:string,ResponseData:string,SameInd:boolean>>>,ReqStatus:string>,MVR:struct<ApplicationDate:string,Height:bigint,LicenseDetail:struct<License:array<struct<ExpirationDate:string,IssuedDate:string,LicenseClass:string,Restriction:string,State:string,Status:string>>>,ReportDate:string,TotalPoints:bigint,ViolationDetail:struct<Violation:array<struct<Code:string,ConvictionReinsDte:string,Points:bigint,ViolationDate:string,ViolationDescription:string,ViolationType:string>>>>,OtherCoverages:struct<OtherCoverage:array<struct<AdditionalDetails:string,AmountInForce:bigint,AmountPending:bigint,CompanyName:string,PolicyPurpose:string,PolicyStatus:string,ReplaceInd:boolean>>>,PreviousPolicies:struct<PreviousPolicy:array<struct<AdditionalCoverage:struct<Coverages:struct<Coverage:array<struct<AddCovUW:struct<AppliedForAmount:bigint,ApprovalDecision:string,ApprovedAmount:bigint,ExtraPremiums:struct<ExtraPremium:array<struct<ExtraPremAmt:bigint,ExtraPremReason:string,ExtraPremYears:bigint,OtherExtraPremAmt:bigint>>>,RiskClass:string>,DividendOption:string,RiderName:string,RiderType:string>>>>,CaseEUAgent:string,HOAssignedAppNumber:string,MainBenefit:struct<APL:boolean,DPPO:string,DividendOption:string,MainBenUW:struct<AppliedForAmount:bigint,ApprovalDecision:string,ApprovedAmount:bigint,BasicRating:double,BestCreditProgram:string,DateUWComplete:string,DeclineReasons:struct<DeclineType:string,FCRAReason:string,MultipleFCRA:string,OtherDeclineReason:string,Reconsideration:string>,ExtraPremiums:struct<ExtraPremium:array<struct<ExtraPremAmt:bigint,ExtraPremReason:string,ExtraPremYears:bigint,OtherExtraPremAmt:bigint>>>,ReinsuranceDetails:struct<AmountRetained:bigint,PercentageRetained:bigint,ReinsAccept:boolean,ReinsCompany:string,ReinsFace:bigint,ReinsOfferRet:boolean,ReinsSmokeStatus:string,ReinsuranceType:string>,RiskClass:string,TotalInforceAndAppliedInsRisk:bigint,TotalRate:double,TotalRating:bigint,UWNotices:struct<UILReason:string,UWNoticeReason:string>,WorkingRating:double>,PlanName:string,ProductCode:string,ProductVersionCode:string>,PolStatus:string,PolicyNumber:string,ProposalDate:string,SendMiscIndicators:struct<FastTrackInd:boolean,PendingCaseEBDRecvd:boolean,PendingCaseOutScope:boolean,PriorCaseAOP:boolean,PriorCaseAOPStatus:string,PriorCaseBOCP:boolean,PriorCaseCCRDeclined:boolean,PriorCaseDeclined:boolean,PriorCaseEBDRecvd:boolean,PriorCaseFacReins:boolean,PriorCaseRNF:boolean,PriorCaseRated:boolean,PriorCaseWPADBRated:boolean>>>>,RXDB:struct<AlternateOrderRef:string,AppResultStatus:string,Comments:string,CreationDate:string,DateSpan:string,FillDate:string,Gender:string,InsuredState:string,InsuredType:string,InsuredZipCode:string,OrderResult:string,OrderResultStatus:string,OrderResultURL:string,Prescriptions:struct<Prescription:array<struct<BrandName:string,Dosage:string,GenericName:string,Indications:struct<Indication:array>,NDC:string,NoDaysFill:bigint,PharmacyCode:string,Physician:struct<Address:struct<AddressCountry:string,AddressState:string,Zip:string>,Speciality:string>,PriorityOfDrug:string,Quantity:bigint>>>,ReviewResult:string,TotalNoFills:bigint,TrackingID:string>,TeleApp:struct<AdditionalDetails:string,CitizQSeePart1:boolean,CountryOfCitizenship:string,EDeliveryInd:boolean,EmploymentDet:string:string' but ':' is found.

skm235 commented 7 years ago

It's hive issue rather than spark one. Please increase your column width as mentioned in the below jira. https://www.mail-archive.com/issues@hive.apache.org/msg52188.html

HyukjinKwon commented 7 years ago

@skm235 Thanks for your investigation. I am closing this.