VowpalWabbit / vowpal_wabbit

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
https://vowpalwabbit.org
Other
8.49k stars 1.93k forks source link

Running Dependency Parsing Demo with different datasets failing #2413

Closed Casperfrc closed 4 years ago

Casperfrc commented 4 years ago

Describe the bug

The demo is in the following directory: _vowpalwabbit/demo/dependencyparsing/

I am looking into using different datasets on the demo than the standard of wsj_train_subset and wsj_test_subset. I have created different testfiles based on some other data I found for dependency parsing but even after formatting the data seemingly fitting for the demo it won't entirely parse it.

I created the following files based on actual data: custom_test.txt custom_training.txt

I am aware the spacing is not exactly the same as the demo's data, but gathered it made no difference. I did although try to create a very little dataset that gets the same error: small_test.txt

Following the steps of the Makefile, I have isolated the issue. I manage to parse the training data and the test data with parse_data.py but when I reach the dep.model part of the Makefile it halts and prints the following: final_regressor = model.vw Enabling FTRL based optimization Algorithm used: Proximal-FTRL ftrl_alpha = 0.005 ftrl_beta = 0.1 Num weight bits = 30 learning rate = 0.5 initial_t = 0 power_t = 0.5 decay_learning_rate = 1 creating cache_file = /home/casperfrc/projects/bachelor_contextual_bandit/data/dependency_parsing/custom_training.cache Reading datafile = /home/casperfrc/projects/bachelor_contextual_bandit/data/dependency_parsing/custom_training num sources = 1 vw example #5(cost_sensitive.cc:179): invalid cost: specification -- no names on: :

To Reproduce

Steps to reproduce the behavior:

  1. git clone https://github.com/VowpalWabbit/vowpal_wabbit.git
  2. cd vowpal_wabbit/demo/dependencyparsing
  3. download the training and test data from the intro
  4. python3 parse_data.py custom_training model.vw
  5. python3 parse_data.py custom_test model_tested.vw
  6. vw --passes 3 -d custom_training -k -c --search_rollin mix_per_roll --search_task dep_parser --search 12 --search_alpha 1e-5 --search_rollout oracle --holdout_off -f tested_model.vw --search_history_length 3 --search_no_caching -b 30 --root_label 8 --num_label 12 --nn 5 --ftrl

Expected behavior

I was expecting a model to be created.

Observed Behavior

As mentioned earlier, the error message is:

vw example #5(cost_sensitive.cc:179): invalid cost: specification -- no names on: :`

When just running make dep.perf, the error is instead:

Makefile:38: *** missing separator. Stop.

This really made me question my data files, but I just can't find the issue.

Environment

What version of VW did you use? 8.5.0

What OS or language did you use? I'm on Ubuntu 18.04

Additional context

jackgerrits commented 4 years ago

I think you got a little mixed up with the files you're passing VW. parse_data.py converts the original format to VW text cost sensitive examples.

python3 parse_data.py <input_file> <output_file>

So you probably want to run:

python3 parse_data.py custom_training.txt vw_custom_training.txt
python3 parse_data.py custom_test.txt vw_custom_test.txt
vw --passes 3 -d vw_custom_training.txt -k -c --search_rollin mix_per_roll --search_task dep_parser --search 12 --search_alpha 1e-5 --search_rollout oracle --holdout_off -f model.vw --search_history_length 3 --search_no_caching -b 30 --root_label 8 --num_label 12 --nn 5 --ftrl

When I run this I get the following error:

terminate called after throwing an instance of 'VW::vw_exception'
  what():  invalid label 13 which is > num actions=12
fish: 'vw --passes 3 -d train_data.txt…' terminated by signal SIGABRT (Abort)

I am not familiar with the dependency parsing scenario, but the following is the multi example that is causing the error. There is likely an issue with the way the original dataset was formed, but I do not know enough about the dependency parsing scenario to be able to say without researching deeper.

4 9 4:aux|w do |p vbp
3 7 3:compound|w museum |p nn
4 10 4:nsubj|w labels |p nns
0 8 0:root|w have |p vb
6 12 6:det|w an |p dt
4 7 4:obj|w impact |p nn
8 4 8:case|w on |p in
6 2 6:nmod|w how |p wrb
10 10 10:nsubj|w people |p nns
8 13 8:acl:relcl|w look |p vbp         <--- This is the troublesome example
12 4 12:case|w at |p in
10 11 10:obl|w artworks |p nns
4 3 4:punct|w ? |p .

Hope this helps! Let me know if you have more questions

Casperfrc commented 4 years ago

Hey Jack,

This really helped, I realised what the issue was after running it exactly like you told me. (Furthermore, I realised I had broken the Makefile in some way, so I re-downloaded that, whoops.)

The issue I had was simply the fact that I had more labels than the command in the Makefile was defining. It was defining 12, I had 34.

Thanks a lot for the help! Really appreciate what you guys are working on here.

- Casper

jackgerrits commented 4 years ago

Glad I could help you out @Casperfrc! Don't hesitate to reach out if you face issues.