UBOdin / mimir

Data-ish exploration through SQL+Uncertainty
http://mimirdb.info
Apache License 2.0
27 stars 13 forks source link

Simplify error-aware CSV parser #335

Open okennedy opened 5 years ago

okennedy commented 5 years ago

Currently the error aware CSV parser is a mod of the existing Spark CSV parser.

  1. There's a lot of overhead in the CSV parser for dealing with things that Mimir already deals with (e.g., type inference, header detection, etc...)
  2. The spark CSV parser already has some error detection capabilities (see org.apache.spark.sql.execution.datasources.FailureSafeParser). We might be able to leverage some of these as well.