Closed quigonjeff closed 7 years ago
Supposing the problem was with the input file, I upload a second version of the csv file. This time the file has two columns: (a) index: integer; (b) skills: text, quoted, with the skills separated by commas.
Then I changed the “Data reader” block to point to the new skills data source; and changed the “Frequent itemsets mining”, so that now "Attribute with transactions" = skills.
When I tried to execute the new version of the workflow I got the following error message:
Traceback (most recent call last): File "/usr/local/juicer/juicer/spark/spark_minion.py", line 287, in _perform_execute loader.workflow, loader.graph, {}, out, job_id) File "/usr/local/juicer/juicer/spark/transpiler.py", line 222, in transpile class_name = self.operations[task['operation']['slug']] KeyError: u'association-rules'
I am really lost here. Any help is welcome.
Hi @quigonjeff, Could you send me a print screen of your data reader output? 1) Create a new workflow; 2) Put only a Data Reader Operator and select the box to display a data sample; 3) Execute; 4) Select the Data Reader field in the job's log, so it will print a sample of your database;
I would, but right now, no datasource is appearing in the “Data source” comobobox. I checked and my datasources are appearing in the “Datasources” page. Even the datasource of the workflow I had created before is gone, the field is blank and when I open the combobox nothin appears (the dropdown list is empty).
Hi, @quigonjeff, Could you send me the browser console error when you try to open the "data source" combobox?
After loging in just now, the data source combobox is showing the datasources again. I executed the steps you asked. The output is as follows:
This is the result when I select the file with 2 columns (techjobs-skills.csv)
When I select the file with many columns (techjobs-skills-unquoted.csv) -- with number of items varying for each transaction -- and try to execute I get an error:
Traceback (most recent call last): File "/usr/local/juicer/juicer/spark/spark_minion.py", line 306, in _perform_execute self._emit_event(room=job_id, namespace='/stand')) File "/tmp/juicer_app_10_10_123.py", line 146, in main task_futures['d2a3fd11-7d2f-40e2-81b5-4bdf8c24d1fd'].result() File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_base.py", line 405, in result return self.get_result() File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/thread.py", line 55, in run result = self.fn(*self.args, **self.kwargs) File "/tmp/juicer_app_10_10_123.py", line 145, in
lambda: data_reader_0(spark_session, cached_state, emit_event)) File "/tmp/juicer_app_10_10_123.py", line 116, in data_reader_0 mode='PERMISSIVE') File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 407, in csv columnNameOfCorruptRecord=columnNameOfCorruptRecord, multiLine=multiLine) File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 54, in _set_opts self.schema(schema) File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 105, in schema jschema = spark._jsparkSession.parseDataType(schema.json()) File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1133, in call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) IllegalArgumentException: u'Failed to convert the JSON string \'{"metadata":{},"name":"value","nullable":1,"type":"string"}\' to a field.'
Each transaction must to have a set of items in a same column, so the system could read the first file and not the second. When you create a workflow to get the Frequent Itemset Mining, try to add a "Transformation" box before the Frequent Itemset because it needs a vector of items as a parameter. With the Transformation Expression Editor, you can use a split function to break each line in an array of items (try to split each line by comma). Ex.: split(skills, ',') Create a name for the new transformed attribute, you will use it in frequent itemset box. Insert a value to minimum support; Save and Execute.
I am not sure this is a bug or a misuse problem. I tried to build a simple workflow.
After a while waiting I received the message: «Invalid or missing parameters: Missing parameter attribute»
I don't understand what is the problem. Should the csv file have a different format?