eubr-bigsea / citron

Lemonade's front end in Emberjs
3 stars 0 forks source link

Frequent itemsets mining - Invalid or missing parameters: Missing parameter attribute #427

Closed quigonjeff closed 7 years ago

quigonjeff commented 7 years ago

I am not sure this is a bug or a misuse problem. I tried to build a simple workflow.

  1. Uploaded a csv file (skills.csv) with one transaction per line, but varying number of items per transaction; as instructed in the lecture of 2017-09-18.
  2. Created a new workflow;
  3. Added a “Data reader” block; Data source: skills.csv; other parameters with default values;
  4. Added a “Frequent itemsets mining” block; Minimum support: 10000; other parameters with default values;
  5. Connected the [Data reader/output data] with [Frequent itemsets mining/input data];
  6. Added a “Table visualization”; all parameters with default values;
  7. Connected [Frequent itemsets mining/output data] with [Table visualization/input data];
  8. Added a “Association rules” block; all parameters with default values;
  9. Connected [Frequent itemsets mining/rules output] with [Association rules/input data];
  10. Saved;
  11. Executed;

After a while waiting I received the message: «Invalid or missing parameters: Missing parameter attribute»

I don't understand what is the problem. Should the csv file have a different format?

quigonjeff commented 7 years ago

Supposing the problem was with the input file, I upload a second version of the csv file. This time the file has two columns: (a) index: integer; (b) skills: text, quoted, with the skills separated by commas.

Then I changed the “Data reader” block to point to the new skills data source; and changed the “Frequent itemsets mining”, so that now "Attribute with transactions" = skills.

When I tried to execute the new version of the workflow I got the following error message:

Traceback (most recent call last): File "/usr/local/juicer/juicer/spark/spark_minion.py", line 287, in _perform_execute loader.workflow, loader.graph, {}, out, job_id) File "/usr/local/juicer/juicer/spark/transpiler.py", line 222, in transpile class_name = self.operations[task['operation']['slug']] KeyError: u'association-rules'

I am really lost here. Any help is welcome.

ojvictor commented 7 years ago

Hi @quigonjeff, Could you send me a print screen of your data reader output? 1) Create a new workflow; 2) Put only a Data Reader Operator and select the box to display a data sample; data_reader_sample 3) Execute; 4) Select the Data Reader field in the job's log, so it will print a sample of your database; jobs_show_database

quigonjeff commented 7 years ago

I would, but right now, no datasource is appearing in the “Data source” comobobox. I checked and my datasources are appearing in the “Datasources” page. Even the datasource of the workflow I had created before is gone, the field is blank and when I open the combobox nothin appears (the dropdown list is empty).

ojvictor commented 7 years ago

Hi, @quigonjeff, Could you send me the browser console error when you try to open the "data source" combobox?

quigonjeff commented 7 years ago

After loging in just now, the data source combobox is showing the datasources again. I executed the steps you asked. The output is as follows:

This is the result when I select the file with 2 columns (techjobs-skills.csv) screenshot_2017-09-21_18-08-15

When I select the file with many columns (techjobs-skills-unquoted.csv) -- with number of items varying for each transaction -- and try to execute I get an error:

Traceback (most recent call last): File "/usr/local/juicer/juicer/spark/spark_minion.py", line 306, in _perform_execute self._emit_event(room=job_id, namespace='/stand')) File "/tmp/juicer_app_10_10_123.py", line 146, in main task_futures['d2a3fd11-7d2f-40e2-81b5-4bdf8c24d1fd'].result() File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/_base.py", line 405, in result return self.get_result() File "/usr/local/lib/python2.7/dist-packages/concurrent/futures/thread.py", line 55, in run result = self.fn(*self.args, **self.kwargs) File "/tmp/juicer_app_10_10_123.py", line 145, in lambda: data_reader_0(spark_session, cached_state, emit_event)) File "/tmp/juicer_app_10_10_123.py", line 116, in data_reader_0 mode='PERMISSIVE') File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 407, in csv columnNameOfCorruptRecord=columnNameOfCorruptRecord, multiLine=multiLine) File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 54, in _set_opts self.schema(schema) File "/usr/local/spark/python/pyspark/sql/readwriter.py", line 105, in schema jschema = spark._jsparkSession.parseDataType(schema.json()) File "/usr/local/lib/python2.7/dist-packages/py4j/java_gateway.py", line 1133, in call__ answer, self.gateway_client, self.target_id, self.name) File "/usr/local/spark/python/pyspark/sql/utils.py", line 79, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) IllegalArgumentException: u'Failed to convert the JSON string \'{"metadata":{},"name":"value","nullable":1,"type":"string"}\' to a field.'

ojvictor commented 7 years ago

Each transaction must to have a set of items in a same column, so the system could read the first file and not the second. When you create a workflow to get the Frequent Itemset Mining, try to add a "Transformation" box before the Frequent Itemset because it needs a vector of items as a parameter. screen shot 2017-09-22 at 00 12 02 With the Transformation Expression Editor, you can use a split function to break each line in an array of items (try to split each line by comma). Ex.: split(skills, ',') screen shot 2017-09-22 at 00 18 01 Create a name for the new transformed attribute, you will use it in frequent itemset box. Insert a value to minimum support; screen shot 2017-09-22 at 00 21 05 Save and Execute.