MI-DPLA / combine

Combine /kämˌbīn/ - Metadata Aggregator Platform
MIT License
26 stars 11 forks source link

transformation error after harvest of records NOT within OAI set #387

Closed ghukill closed 5 years ago

ghukill commented 5 years ago

Would appear that when harvesting records outside of an OAI set, the oai_set column is not handled properly. Assumed this is causing the following error during a subsequent transformation:

utils.LineBufferedStream: stdout: 2019-03-21 13:51:34 INFO  DAGScheduler:54 - ResultStage 12 (runJob at PythonRDD.scala:152) failed in 1.210 s due to Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 814, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
utils.LineBufferedStream: stdout:   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1556, in __getattr__
utils.LineBufferedStream: stdout:     idx = self.__fields__.index(item)
utils.LineBufferedStream: stdout: ValueError: 'oai_set' is not in list
utils.LineBufferedStream: stdout: 
utils.LineBufferedStream: stdout: During handling of the above exception, another exception occurred:
utils.LineBufferedStream: stdout: 
utils.LineBufferedStream: stdout: Traceback (most recent call last):
utils.LineBufferedStream: stdout:   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
utils.LineBufferedStream: stdout:     process()
utils.LineBufferedStream: stdout:   File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
utils.LineBufferedStream: stdout:     serializer.dump_stream(func(split_index, iterator), outfile)
utils.LineBufferedStream: stdout:   File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
utils.LineBufferedStream: stdout:     vs = list(itertools.islice(iterator, batch))
utils.LineBufferedStream: stdout:   File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
utils.LineBufferedStream: stdout:   File "/tmp/spark-efbe4281-2455-45f2-bdbf-fb8cba3a923b/userFiles-81fbc434-604c-4a9b-a119-fb41ee82e28d/jobs.py", line 1328, in transform_xslt_pt_udf
utils.LineBufferedStream: stdout:     oai_set = row.oai_set,
utils.LineBufferedStream: stdout:   File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1561, in __getattr__
utils.LineBufferedStream: stdout:     raise AttributeError(item)
utils.LineBufferedStream: stdout: AttributeError: oai_set
ghukill commented 5 years ago

fixed