FRosner / drunken-data-quality

Spark package for checking data quality
Apache License 2.0
222 stars 69 forks source link

Verify a descriptor passed into PyDDQ FileOutputStream constructor #107

Closed Gerrrr closed 8 years ago

Gerrrr commented 8 years ago

With this PR we will have more user-friendly error messages from passing incorrect arguments into FileOutputStream, rather than

AttributeError: 'OutStream' object has no attribute 'mode'
codecov-io commented 8 years ago

Current coverage is 100% (diff: 100%)

Merging #107 into master will not change coverage

@@           master   #107   diff @@
====================================
  Files          24     24          
  Lines         437    437          
  Methods       421    421          
  Messages        0      0          
  Branches       16     16          
====================================
  Hits          437    437          
  Misses          0      0          
  Partials        0      0          

Powered by Codecov. Last update c89d664...a9cbdfd

FRosner commented 8 years ago

LGTM @Gerrrr. How can I reproduce the initial problem?

Gerrrr commented 8 years ago

Before:

>>> from pyddq.streams import FileOutputStream
>>> fos = FileOutputStream("i am string, not file")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyddq/streams.py", line 39, in __init__
    mode = descriptor.mode
AttributeError: 'str' object has no attribute 'mode'

or in Jupyter or Zeppelin, where sys.stdout is overwritten by non-file type:

from pyddq.core import Check

df = sqlContext.createDataFrame([(1, "a"), (1, None), (3, "c")])
check = Check(df).hasUniqueKey("_1", "_2").isNeverNull("_1")
check.run()
AttributeError                            Traceback (most recent call last)
<ipython-input-1-49d6990a80e2> in <module>()
      3 df = sqlContext.createDataFrame([(1, "a"), (1, None), (3, "c")])
      4 check = Check(df).hasUniqueKey("_1", "_2").isNeverNull("_1")
----> 5 check.run()

/Users/gerrrr/Work/drunken-data-quality/ve/lib/python2.7/site-packages/pyddq/core.pyc in run(self, reporters)
    337         """
    338         if not reporters:
--> 339             reporters = [ConsoleReporter()]
    340 
    341         jvm_reporters = jc.iterable_to_scala_list(

/Users/gerrrr/Work/drunken-data-quality/ve/lib/python2.7/site-packages/pyddq/reporters.pyc in __init__(self, output_stream)
     10     def __init__(self, output_stream=None):
     11         if not output_stream:
---> 12             output_stream = FileOutputStream(sys.stdout)
     13 
     14         if not isinstance(output_stream, OutputStream):

/Users/gerrrr/Work/drunken-data-quality/ve/lib/python2.7/site-packages/pyddq/streams.pyc in __init__(self, descriptor)
     37     def __init__(self, descriptor):
     38         self.descriptor = descriptor
---> 39         mode = descriptor.mode
     40         if mode == "r":
     41             raise ValueError("Descriptor is opened for reading")

AttributeError: 'OutStream' object has no attribute 'mode'

With this PR in both cases the error is:

>>> from pyddq.streams import FileOutputStream
>>> fos = FileOutputStream("i am string, not file")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyddq/streams.py", line 39, in __init__
    raise ValueError("Descriptor is not a file")
ValueError: Descriptor is not a file