TresAmigosSD / SMV

Spark Modularized View
Apache License 2.0
42 stars 22 forks source link

Delimiter ";" does not work #1585

Closed fei-cheng closed 5 years ago

fei-cheng commented 5 years ago

When I changed the delimiter of CB1200CZ11.csv from | to ;, and changed the @delimiter = ; accordingly, I got below error while running the module in demo.

Run command: smv-run -m stage1.employment.EmploymentByState

Error log:

----------------------
stage1.employment.EmploymentByState
----------------------
Traceback (most recent call last):
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/tools/../src/main/python/scripts/runapp.py", line 17, in <module>
    SmvDriver().run()
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvdriver.py", line 60, in run
    self.main(app, driver_args)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvdriver.py", line 41, in main
    app.run()
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvapp.py", line 710, in run
    self._generate_output_modules(mods)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvapp.py", line 680, in _generate_output_modules
    SmvModuleRunner(mods, self).run()
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvmodulerunner.py", line 49, in run
    self._create_df(known, mods_to_run_post_action, collector, forceRun)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvmodulerunner.py", line 125, in _create_df
    self.visitor.dfs_visit(runner, (known, need_post, collector), need_to_run_only=True)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/modulesvisitor.py", line 87, in dfs_visit
    action(m, state)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvmodulerunner.py", line 124, in runner
    m._do_it(fqn2df, run_set, collector, forceRun, is_quick_run)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvgenericmodule.py", line 271, in _do_it
    self._populate_data(fqn2df, run_set, collector, forceRun, is_quick_run)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvgenericmodule.py", line 287, in _populate_data
    res = self._computeData(fqn2df, run_set, collector, is_quick_run)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvgenericmodule.py", line 312, in _computeData
    raw_df = self.doRun(fqn2df)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smvinput.py", line 46, in doRun
    df = super(SmvCsvFile, self).doRun(known)
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/iomod/inputs.py", line 321, in doRun
    self.smvSchema(),
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/iomod/inputs.py", line 291, in smvSchema
    schema = SmvSchemaOnHdfsIoStrategy(self.smvApp, abs_file_path).read()
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/smv/smviostrategy.py", line 353, in read
    self._file_path
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/Users/fei.cheng/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.fromFile.
: java.lang.ArrayIndexOutOfBoundsException: 1
    at org.tresamigos.smv.SmvSchema$$anonfun$14.apply(SmvSchema.scala:522)
    at org.tresamigos.smv.SmvSchema$$anonfun$14.apply(SmvSchema.scala:522)
    at scala.collection.immutable.List.map(List.scala:288)
    at org.tresamigos.smv.SmvSchema$.schemaFromEntryStrings(SmvSchema.scala:522)
    at org.tresamigos.smv.SmvSchema$.fromFile(SmvSchema.scala:545)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
AliTajeldin commented 5 years ago

@fei-cheng what version of SMV and Spark? I suspect this is due to the split on ';' we do on the input schema string (either input from file or direct string). We can create a dummy sentinel to represent a semicolon (e.g. @delimiter = semi-colon) and then translate that sentinel to a real ';' on read/write.

fei-cheng commented 5 years ago

I tested on both SMV 2.1.1.1 and the latest version installed by 'pip install', they both have the same error.

ninjapapa commented 5 years ago

Pretty sure you can fix it here: https://github.com/TresAmigosSD/SMV/blob/master/src/main/scala/org/tresamigos/smv/SmvSchema.scala#L490

@fei-cheng you want to give it a try?

AliTajeldin commented 5 years ago

and here for the parsing part: https://github.com/TresAmigosSD/SMV/blob/master/src/main/scala/org/tresamigos/smv/SmvSchema.scala#L466

fei-cheng commented 5 years ago

@ninjapapa I will try to fix it