Would appear that when harvesting records outside of an OAI set, the oai_set column is not handled properly. Assumed this is causing the following error during a subsequent transformation:
utils.LineBufferedStream: stdout: 2019-03-21 13:51:34 INFO DAGScheduler:54 - ResultStage 12 (runJob at PythonRDD.scala:152) failed in 1.210 s due to Job aborted due to stage failure: Task 0 in stage 12.0 failed 1 times, most recent failure: Lost task 0.0 in stage 12.0 (TID 814, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
utils.LineBufferedStream: stdout: File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1556, in __getattr__
utils.LineBufferedStream: stdout: idx = self.__fields__.index(item)
utils.LineBufferedStream: stdout: ValueError: 'oai_set' is not in list
utils.LineBufferedStream: stdout:
utils.LineBufferedStream: stdout: During handling of the above exception, another exception occurred:
utils.LineBufferedStream: stdout:
utils.LineBufferedStream: stdout: Traceback (most recent call last):
utils.LineBufferedStream: stdout: File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 253, in main
utils.LineBufferedStream: stdout: process()
utils.LineBufferedStream: stdout: File "/opt/spark/python/lib/pyspark.zip/pyspark/worker.py", line 248, in process
utils.LineBufferedStream: stdout: serializer.dump_stream(func(split_index, iterator), outfile)
utils.LineBufferedStream: stdout: File "/opt/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 379, in dump_stream
utils.LineBufferedStream: stdout: vs = list(itertools.islice(iterator, batch))
utils.LineBufferedStream: stdout: File "/opt/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1352, in takeUpToNumLeft
utils.LineBufferedStream: stdout: File "/tmp/spark-efbe4281-2455-45f2-bdbf-fb8cba3a923b/userFiles-81fbc434-604c-4a9b-a119-fb41ee82e28d/jobs.py", line 1328, in transform_xslt_pt_udf
utils.LineBufferedStream: stdout: oai_set = row.oai_set,
utils.LineBufferedStream: stdout: File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1561, in __getattr__
utils.LineBufferedStream: stdout: raise AttributeError(item)
utils.LineBufferedStream: stdout: AttributeError: oai_set
Would appear that when harvesting records outside of an OAI set, the
oai_set
column is not handled properly. Assumed this is causing the following error during a subsequent transformation: