arangodb / arangodb-spark-datasource

ArangoDB Connector for Apache Spark, using the Spark DataSource API
Apache License 2.0
14 stars 11 forks source link

[DE-85] Feature/bad records #1

Closed rashtao closed 2 years ago

rashtao commented 2 years ago

Allow setting bad record handling policy via config parameter, i.e.:

spark.read
    .option("mode", "PERMISSIVE|DROPMALFORMED|FAILFAST")

Review JacksonParser behavior and the cases in which it throws BadRecordException.Review official documentation of DataFrameReader#json (https://spark.apache.org/docs/3.1.2/api/java/org/apache/spark/sql/DataFrameReader.html#json-java.lang.String...-), in particular:

mode (default PERMISSIVE): allows a mode for dealing with corrupt records during parsing.

    PERMISSIVE : when it meets a corrupted record, puts the malformed string into a field 
      configured by columnNameOfCorruptRecord, and sets malformed fields to null. To keep corrupt 
      records, an user can set a string type field named columnNameOfCorruptRecord in an 
      user-defined schema. If a schema does not have the field, it drops corrupt records during 
      parsing. When inferring a schema, it implicitly adds a columnNameOfCorruptRecord field in 
      an output schema.
    DROPMALFORMED : ignores the whole corrupted records.
    FAILFAST : throws an exception when it meets corrupted records.

columnNameOfCorruptRecord (default is the value specified in spark.sql.columnNameOfCorruptRecord): 
  allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides 
  spark.sql.columnNameOfCorruptRecord.
sonarcloud[bot] commented 2 years ago

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

0.0% 0.0% Coverage
5.6% 5.6% Duplication