Riccorl / transformer-srl

Reimplementation of a BERT based model (Shi et al, 2019), currently the state-of-the-art for English SRL. This model implements also predicate disambiguation.
69 stars 9 forks source link

How do you train the model on VerbAtlas? #11

Closed egumasa closed 3 years ago

egumasa commented 3 years ago

Hi Thank you very much for this wonderful package and your encouraging responses on the discussions in this repo! I have downloaded the package and retained model and it runs perfect.

I was not sure if this is an appropriate place to ask this, but I wonder how you trained the model on VerbAtlas. Looks like the pretrained model is trained on PropBank, but you I found that you have also tried on VerbAtlas (if I am not mistaken). I have downloaded VerbAtlas resources from here, but I did not find sentence annotations. Could you provide me with some insights how you would go about this? In the mean time (when it turns out there is no publicly available annotations with VerbAtlas), I am thinking to look up VerbAtlas frames using the predicate ids returned by this package...

Thank you very much, and I apologize if this is off-topic!

Riccorl commented 3 years ago

Hi!

The pretrained model is trained only on PropBank. If you want to train using labels from VerbAtlas, you can produce a dataset using the original CoNLL-2012 dataset and the mapping files inside the VerbAtlas resources you linked. There should be a file called pb2va.tsv that maps the labels from PropBank to VerbAtlas.

egumasa commented 3 years ago

Thank you very much for your prompt response, and your suggestion! So if I understand it correctly, I need OntoNote 5, which I lucky have because of institutional subscription to LDC, and CoNLL-2012 shared task data? Then I need to prepare the dataset following the description on CoNLL-2012 website?

Look like the subsequent step is swaping the CoNLL frame id with VerbAtlas and doing the same for semantic roles (such as the column 7 or 8 in the CoNLL format)? I wonder whether is the semantic role information stored (where I have to switch Arg0 to Agent , for example).

In actual training step, can I follow the following? From https://github.com/Riccorl/transformer-srl/issues/2#issuecomment-691740383

Hi!

Unfortunately, I don't have a pretrained model with PropBank inventory. You have to train it :(

However, train should be easy. It can run on CPU, yes, but it's really slow. You can use Colab to train it with a GPU, for free. To run it, you can clone this repo and run

export SRL_TRAIN_DATA_PATH="path/to/train"
export SRL_VALIDATION_DATA_PATH="path/to/development"
allennlp train training_config/bert_base_span.jsonnet -s path/to/model --include-package transformer_srl

where training_config/bert_base_span.jsonnet is the config file that I usually use.

So, setting the train and development path will allow the last part of code to train the Model through AllenNLP and Transformer-SRL?

Thank you so much for your time. I really appreciate the package the your generosity to answer the questions!

Riccorl commented 3 years ago

So if I understand it correctly, I need OntoNote 5, which I lucky have because of institutional subscription to LDC, and CoNLL-2012 shared task data? Then I need to prepare the dataset following the description on CoNLL-2012 website?

Yes, that's right. Keep in mind that there are two referenes for the CoNLL-2012 preparation. One of them is broken and produces spans that overlaps. The reader in this repo cannot read this version and throws an error. Unfortunatly I don't remember which one is correct, but it should pop up on Google if you search for it.

Look like the subsequent step is swaping the CoNLL frame id with VerbAtlas and doing the same for semantic roles (such as the column 7 or 8 in the CoNLL format)? I wonder whether is the semantic role information stored (where I have to switch Arg0 to Agent , for example).

I don't understand your question here :( . Are you asking what to do next? You should replace verb senses with the correct VerbAtlas frame and their roles, following the mapping. I suggest to discard arguments that are not in the mapping. E.g. a verb sense that is not present in it, or a role that is not among its candidates.

So, setting the train and development path will allow the last part of code to train the Model through AllenNLP and Transformer-SRL?

Yep, it should be correct. If you find errors reach me, I don't remember now the state of the code. It should be stable, but I'm not sure :D

egumasa commented 3 years ago

Thank you for answering the question. Sorry about the grammar errors that hindered understanding. Yes I was asking what to do when changing from the prob-bank annotation to VerbAtlas. I think the CoNLL-2012 script ran and I have data, so I will try replicating the model to see if the scores are similar. Then I will move onto swapping the annotation and try training on the new data.

Thank you again, and I think for now you can close the thread (I may come back asking follow-up questions after trying replicating the propbank model)!

Riccorl commented 3 years ago

Thank you for answering the question. Sorry about the grammar errors that hindered understanding. Yes I was asking what to do when changing from the prob-bank annotation to VerbAtlas. I think the CoNLL-2012 script ran and I have data, so I will try replicating the model to see if the scores are similar. Then I will move onto swapping the annotation and try training on the new data.

Thank you again, and I think for now you can close the thread (I may come back asking follow-up questions after trying replicating the propbank model)!

Great! Let me know if you have other questions.

egumasa commented 3 years ago

Hi again, I am still trying to do the step 1, which is replicating the pretrained model with Ontonote5 and CoNLL-2012. It took me a while to solve the dependency issues among the package (I use M1 Mac and installing TensorFlow was a challenge, but I also saw your another repo on this too! Thanks for your documentation!).

So I think got my CoNLL-2012 cleaned, but the code allennlp train training_config/bert_base_span.jsonnet -s models --include-package transformer_srl but I got the following error. Might be the issue you mentioned about format of the CoNLL file.

` (nlp) xxxxxxx transformer-srl % allennlp train training_config/bert_base_span.jsonnet -s models --include-package transformer_srl 2021-07-09 00:09:58,226 - INFO - allennlp.common.plugins - Plugin allennlp_models available 2021-07-09 00:09:58,230 - INFO - allennlp.common.plugins - Plugin transformer_srl available 2021-07-09 00:09:58,270 - INFO - allennlp.common.params - include_in_archive = None 2021-07-09 00:09:58,270 - INFO - allennlp.common.params - random_seed = 13370 2021-07-09 00:09:58,271 - INFO - allennlp.common.params - numpy_seed = 1337 2021-07-09 00:09:58,271 - INFO - allennlp.common.params - pytorch_seed = 133 2021-07-09 00:09:58,273 - INFO - allennlp.common.checks - Pytorch version: 1.7.1 2021-07-09 00:09:58,273 - INFO - allennlp.common.params - type = default 2021-07-09 00:09:58,273 - INFO - allennlp.common.params - dataset_reader.type = transformer_srl_span 2021-07-09 00:09:58,273 - INFO - allennlp.common.params - dataset_reader.lazy = False 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.cache_directory = None 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.max_instances = None 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.manual_distributed_sharding = False 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.manual_multi_process_sharding = False 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.token_indexers = None 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.domain_identifier = None 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.bert_model_name = None 2021-07-09 00:09:58,274 - INFO - allennlp.common.params - dataset_reader.model_name = bert-base-cased 2021-07-09 00:10:01,949 - INFO - allennlp.common.params - train_data_path = conll-2012/v4/data/train 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - vocabulary = <allennlp.common.lazy.Lazy object at 0x7ff6d924bbb0> 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - datasets_for_vocab_creation = None 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - validation_dataset_reader = None 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - validation_data_path = conll-2012/v4/data/development 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - validation_data_loader = None 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - test_data_path = None 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - evaluate_on_test = False 2021-07-09 00:10:01,950 - INFO - allennlp.common.params - batch_weight_key = 2021-07-09 00:10:01,950 - INFO - allennlp.training.util - Reading training data from conll-2012/v4/data/train huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks... To disable this warning, you can either:

Do you have any idea why this might be happening? Here is my environment. I appreciate any insights!

Thank you so much!

Name Version Build Channel absl-py 0.13.0 pypi_0 pypi alabaster 0.7.12 pyhd3eb1b0_0
allennlp 1.2.2 pypi_0 pypi allennlp-models 1.2.2 pypi_0 pypi appdirs 1.4.4 py_0
applaunchservices 0.2.1 py_0
appnope 0.1.2 py38hecd8cb5_1001
argh 0.26.2 py38_0
argon2-cffi 20.1.0 pypi_0 pypi arrow 0.13.1 py38_0
astroid 2.6.2 py38hecd8cb5_0
astunparse 1.6.3 pypi_0 pypi async_generator 1.10 pyhd3eb1b0_0
atomicwrites 1.4.0 py_0
attrs 21.2.0 pyhd3eb1b0_0
autopep8 1.5.6 pyhd3eb1b0_0
babel 2.9.1 pyhd3eb1b0_0
backcall 0.2.0 pyhd3eb1b0_0
beautifulsoup4 4.9.3 pypi_0 pypi benepar 0.1.3 pypi_0 pypi binaryornot 0.4.4 pyhd3eb1b0_1
black 19.10b0 py_0
bleach 3.3.0 pyhd3eb1b0_0
blis 0.7.4 pypi_0 pypi boto3 1.17.78 pypi_0 pypi botocore 1.20.78 pypi_0 pypi brotlipy 0.7.0 py38h9ed2024_1003
bs4 0.0.1 pypi_0 pypi ca-certificates 2021.5.25 hecd8cb5_1
cachetools 4.2.2 pypi_0 pypi catalogue 1.0.0 pypi_0 pypi certifi 2021.5.30 pypi_0 pypi cffi 1.14.5 py38h2125817_0
chardet 4.0.0 py38hecd8cb5_1003
click 7.1.2 pypi_0 pypi cloudpickle 1.6.0 py_0
colorama 0.4.4 pyhd3eb1b0_0
configparser 5.0.2 pypi_0 pypi conllu 4.2.1 pypi_0 pypi cookiecutter 1.7.2 pyhd3eb1b0_0
cryptography 3.4.7 py38h2fd3fbb_0
cymem 2.0.5 pypi_0 pypi cython 0.29.23 pypi_0 pypi dbus 1.13.18 h18a8e69_0
decorator 5.0.9 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
diff-match-patch 20200713 py_0
docker-pycreds 0.4.0 pypi_0 pypi docutils 0.17.1 py38hecd8cb5_1
dynet 2.1.2 pypi_0 pypi en-core-web-lg 2.3.1 pypi_0 pypi en-core-web-md 2.3.1 pypi_0 pypi en-core-web-sm 2.3.1 pypi_0 pypi entrypoints 0.3 py38_0
expat 2.4.1 h23ab428_2
filelock 3.0.12 pypi_0 pypi flake8 3.9.0 pyhd3eb1b0_0
flatbuffers 1.12 pypi_0 pypi ftfy 5.9 pypi_0 pypi future 0.18.2 py38_1
gast 0.3.3 pypi_0 pypi gettext 0.21.0 h7535e17_0
gitdb 4.0.7 pypi_0 pypi gitpython 3.1.17 pypi_0 pypi glib 2.68.2 hdf23fa2_0
google-auth 1.32.0 pypi_0 pypi google-auth-oauthlib 0.4.4 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi grpcio 1.32.0 pypi_0 pypi h5py 2.10.0 pypi_0 pypi huggingface-hub 0.0.9 pypi_0 pypi icu 58.2 h0a44026_3
idna 2.10 pyhd3eb1b0_0
imagesize 1.2.0 pyhd3eb1b0_0
importlib-metadata 3.10.0 py38hecd8cb5_0
importlib_metadata 3.10.0 hd3eb1b0_0
inflection 0.5.1 py38hecd8cb5_0
iniconfig 1.1.1 pypi_0 pypi intervaltree 3.1.0 py_0
iprogress 0.4 pypi_0 pypi ipykernel 5.3.4 py38h5ca1d4c_0
ipython 7.22.0 py38h01d92e1_0
ipython_genutils 0.2.0 pyhd3eb1b0_1
ipywidgets 7.6.3 pypi_0 pypi isort 5.9.1 pyhd3eb1b0_0
jedi 0.17.2 py38hecd8cb5_1
jinja2 2.11.3 pyhd3eb1b0_0
jinja2-time 0.2.0 pyhd3eb1b0_2
jmespath 0.10.0 pypi_0 pypi joblib 1.0.1 pypi_0 pypi jpeg 9b he5867d9_2
jsonnet 0.17.0 pypi_0 pypi jsonpickle 2.0.0 pypi_0 pypi jsonschema 3.2.0 py_2
jupyter_client 6.1.12 pyhd3eb1b0_0
jupyter_core 4.7.1 py38hecd8cb5_0
jupyterlab-widgets 1.0.0 pypi_0 pypi jupyterlab_pygments 0.1.2 py_0
keras 2.4.3 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi keyring 23.0.1 py38hecd8cb5_0
lazy-object-proxy 1.6.0 py38h9ed2024_0
libcxx 10.0.0 1
libffi 3.3 hb1e8313_2
libiconv 1.16 h1de35cc_0
libpng 1.6.37 ha441bb4_0
libsodium 1.0.18 h1de35cc_0
libspatialindex 1.9.3 h23ab428_0
libxml2 2.9.12 hcdb78fc_0
llvm-openmp 10.0.0 h28b9765_0
lmdb 1.2.1 pypi_0 pypi lxml 4.6.3 pypi_0 pypi markdown 3.3.4 pypi_0 pypi markupsafe 1.1.1 py38h1de35cc_1
mccabe 0.6.1 py38_1
mistune 0.8.4 py38h1de35cc_1001
mmh3 3.0.0 pypi_0 pypi more-itertools 8.8.0 pypi_0 pypi murmurhash 1.0.5 pypi_0 pypi mypy_extensions 0.4.3 py38_0
nbclient 0.5.3 pyhd3eb1b0_0
nbconvert 6.1.0 py38hecd8cb5_0
nbformat 5.1.3 pyhd3eb1b0_0
ncurses 6.2 h0a44026_1
nest-asyncio 1.5.1 pyhd3eb1b0_0
nltk 3.6.2 pypi_0 pypi notebook 6.4.0 pypi_0 pypi numpy 1.21.0 pypi_0 pypi numpydoc 1.1.0 pyhd3eb1b0_1
oauthlib 3.1.1 pypi_0 pypi openssl 1.1.1k h9ed2024_0
opt-einsum 3.3.0 pypi_0 pypi overrides 3.1.0 pypi_0 pypi packaging 20.9 pyhd3eb1b0_0
pandoc 2.12 hecd8cb5_0
pandocfilters 1.4.3 py38hecd8cb5_1
parso 0.7.0 py_0
pathspec 0.7.0 py_0
pathtools 0.1.2 pypi_0 pypi pathy 0.5.2 pypi_0 pypi pcre 8.45 h23ab428_0
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 8.2.0 pypi_0 pypi pip 21.1.3 py38hecd8cb5_0
plac 1.1.3 pypi_0 pypi pluggy 0.13.1 py38hecd8cb5_0
poyo 0.5.0 pyhd3eb1b0_0
preshed 3.0.5 pypi_0 pypi prometheus-client 0.10.1 pypi_0 pypi promise 2.3 pypi_0 pypi prompt-toolkit 3.0.17 pyh06a4308_0
protobuf 3.17.3 pypi_0 pypi psutil 5.8.0 py38h9ed2024_1
ptyprocess 0.7.0 pyhd3eb1b0_2
py 1.10.0 pypi_0 pypi py-rouge 1.1 pypi_0 pypi pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pycodestyle 2.6.0 pyhd3eb1b0_0
pycparser 2.20 py_2
pydantic 1.7.4 pypi_0 pypi pydocstyle 6.1.1 pyhd3eb1b0_0
pyflakes 2.2.0 pyhd3eb1b0_0
pyfn 1.3.13 pypi_0 pypi pygments 2.9.0 pyhd3eb1b0_0
pylint 2.9.1 py38hecd8cb5_1
pyls-black 0.4.6 hd3eb1b0_0
pyls-spyder 0.3.2 pyhd3eb1b0_0
pyopenssl 20.0.1 pyhd3eb1b0_1
pyparsing 2.4.7 pyhd3eb1b0_0
pyqt 5.9.2 py38h655552a_2
pyrsistent 0.17.3 py38haf1e3a3_0
pysocks 1.7.1 py38_1
pytest 6.2.4 pypi_0 pypi python 3.8.8 h88f2d9e_5
python-dateutil 2.8.1 pyhd3eb1b0_0
python-jsonrpc-server 0.4.0 py_0
python-language-server 0.36.2 pyhd3eb1b0_0
python-slugify 5.0.2 pyhd3eb1b0_0
python.app 3 py38h9ed2024_0
pytokenizations 0.8.3 pypi_0 pypi pytz 2021.1 pyhd3eb1b0_0
pyyaml 5.4.1 py38h9ed2024_1
pyzmq 20.0.0 py38h23ab428_1
qdarkstyle 3.0.2 pyhd3eb1b0_0
qstylizer 0.1.10 pyhd3eb1b0_0
qt 5.9.7 h468cd18_1
qtawesome 1.0.2 pyhd3eb1b0_0
qtconsole 5.1.0 pyhd3eb1b0_0
qtpy 1.9.0 py_0
readline 8.1 h9ed2024_0
regex 2021.4.4 py38h9ed2024_0
requests 2.25.1 pyhd3eb1b0_0
requests-oauthlib 1.3.0 pypi_0 pypi rope 0.19.0 pyhd3eb1b0_0
rsa 4.7.2 pypi_0 pypi rtree 0.9.7 py38hecd8cb5_1
s3transfer 0.4.2 pypi_0 pypi sacremoses 0.0.45 pypi_0 pypi scikit-learn 0.24.2 pypi_0 pypi scipy 1.6.3 pypi_0 pypi send2trash 1.5.0 pypi_0 pypi sentencepiece 0.1.91 pypi_0 pypi sentry-sdk 1.1.0 pypi_0 pypi setuptools 57.0.0 pypi_0 pypi shortuuid 1.0.1 pypi_0 pypi sip 4.19.8 py38h0a44026_0
six 1.16.0 pyhd3eb1b0_0
smart-open 3.0.0 pypi_0 pypi smmap 4.0.0 pypi_0 pypi snowballstemmer 2.1.0 pyhd3eb1b0_0
sortedcontainers 2.4.0 pyhd3eb1b0_0
soupsieve 2.2.1 pypi_0 pypi spacy 2.3.7 pypi_0 pypi spacy-alignments 0.8.3 pypi_0 pypi spacy-legacy 3.0.5 pypi_0 pypi spacy-wordnet 0.0.5 pypi_0 pypi sphinx 4.0.2 pyhd3eb1b0_0
sphinxcontrib-applehelp 1.0.2 pyhd3eb1b0_0
sphinxcontrib-devhelp 1.0.2 pyhd3eb1b0_0
sphinxcontrib-htmlhelp 2.0.0 pyhd3eb1b0_0
sphinxcontrib-jsmath 1.0.1 pyhd3eb1b0_0
sphinxcontrib-qthelp 1.0.3 pyhd3eb1b0_0
sphinxcontrib-serializinghtml 1.1.5 pyhd3eb1b0_0
spyder 5.0.0 py38hecd8cb5_1
spyder-kernels 2.0.1 py38hecd8cb5_0
sqlite 3.36.0 hce871da_0
srsly 1.0.5 pypi_0 pypi subprocess32 3.5.4 pypi_0 pypi tensorboard 2.5.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.0 pypi_0 pypi tensorboardx 2.2 pypi_0 pypi tensorflow 2.4.1 pypi_0 pypi tensorflow-estimator 2.4.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi terminado 0.10.0 pypi_0 pypi testpath 0.5.0 pyhd3eb1b0_0
text-unidecode 1.3 py_0
textdistance 4.2.1 pyhd3eb1b0_0
thinc 7.4.5 pypi_0 pypi threadpoolctl 2.1.0 pypi_0 pypi three-merge 0.1.1 pyhd3eb1b0_0
tinycss 0.4 pyhd3eb1b0_1002
tk 8.6.10 hb0a8c7a_0
tokenizers 0.9.3 pypi_0 pypi toml 0.10.2 pyhd3eb1b0_0
torch 1.7.1 pypi_0 pypi torch-struct 0.5 pypi_0 pypi torchcontrib 0.0.2 pypi_0 pypi torchvision 0.8.2 pypi_0 pypi tornado 6.1 py38h9ed2024_0
tqdm 4.60.0 pypi_0 pypi traitlets 5.0.5 pyhd3eb1b0_0
transformer-srl 2.4.6 pypi_0 pypi transformers 3.5.1 pypi_0 pypi typed-ast 1.4.3 py38h9ed2024_1
typer 0.3.2 pypi_0 pypi typing_extensions 3.10.0.0 pyh06a4308_0
ujson 4.0.2 py38h23ab428_0
unidecode 1.2.0 pyhd3eb1b0_0
urllib3 1.26.6 pyhd3eb1b0_1
wandb 0.10.30 pypi_0 pypi wasabi 0.8.2 pypi_0 pypi watchdog 1.0.2 py38h9ed2024_1
wcwidth 0.2.5 py_0
webencodings 0.5.1 py38_1
werkzeug 2.0.1 pypi_0 pypi wheel 0.36.2 pypi_0 pypi whichcraft 0.6.1 pyhd3eb1b0_0
widgetsnbextension 3.5.1 pypi_0 pypi word2number 1.1 pypi_0 pypi wrapt 1.12.1 py38haf1e3a3_1
wurlitzer 2.1.0 py38hecd8cb5_0
xz 5.2.5 h1de35cc_0
yaml 0.2.5 haf1e3a3_0
yapf 0.31.0 pyhd3eb1b0_0
zeromq 4.3.4 h23ab428_0
zipp 3.4.1 pyhd3eb1b0_0
zlib 1.2.11 h1de35cc_3

Riccorl commented 3 years ago

I am still trying to do the step 1, which is replicating the pretrained model with Ontonote5 and CoNLL-2012. It took me a while to solve the dependency issues among the package (I use M1 Mac and installing TensorFlow was a challenge, but I also saw your another repo on this too! Thanks for your documentation!).

There is an update on this! TensorFlow 2.5 now supports M1 natively, through plugins https://developer.apple.com/metal/tensorflow-plugin/

On the error instead. I remember I faced the same issue. How I solved it, it's what I don't remember. You should try to process the ontonotes files again, using the scripts here and the support files here. If I recall correctly, that solved my problem.

egumasa commented 3 years ago

Thank you for this! I wanted to upgrade with miniconda, but I use Spyder as environment, which is not supported by miniconda, to my knowledge. If I train the model through TensorFlow 2.5 (with native M1 env), does it become the requirement for the environment in which I use the model for prediction? (Assuming that I use the same version of the transformer-srl 2.4.6.)

Thanks to your guidance, I was able to fix the issue! I also looked at their dataset, but I guess my issue was that I did not remove Arabic data from the directory (they were all created at the same time when constructing the CoNLL). It now started the training, so I assume the CoNLL issue is fixed. Hoping that I can reproduce similar results!

Riccorl commented 3 years ago

Thank you for this! I wanted to upgrade with miniconda, but I use Spyder as environment, which is not supported by miniconda, to my knowledge. If I train the model through TensorFlow 2.5 (with native M1 env), does it become the requirement for the environment in which I use the model for prediction? (Assuming that I use the same version of the transformer-srl 2.4.6.)

Unfortunately transformer-srl does not support TensorFlow. If you want to use TF you have to implement the model from scratch.

Thanks to your guidance, I was able to fix the issue! I also looked at their dataset, but I guess my issue was that I did not remove Arabic data from the directory (they were all created at the same time when constructing the CoNLL). It now started the training, so I assume the CoNLL issue is fixed. Hoping that I can reproduce similar results!

Great! Let me know how it goes :)

egumasa commented 3 years ago

Hi This is just an update on the progress (because someone might find it helpful).

As for the dataset, the dataset from the CoNLL-2012 from the original source ended up failing (during the first validation). So I referred to the following, and rerun the code.

On the error instead. I remember I faced the same issue. How I solved it, it's what I don't remember. You should try to process the ontonotes files again, using the scripts here and the support files here. If I recall correctly, that solved my problem.

I am running the training currently using the newly created dataset, and it is now in epoch 4/14 (so I assume it will be fine). Because I am training on CPU, and did not change any setting about multicore processing, etc. it is now estimating 15 more days to complete the training. Thank you for the useful information about the CoNLL dataset! I will keep updated (for any future reference)!

egumasa commented 3 years ago

Hi This is an update again. The code for data preparation above worked! The training is done with CPU (M1 Mac mini with 16GB memory). So basically I was able to replicate your results and now will be working on swapping the PropBank to VerbAtlas label. I wonder though, if it is possible to use other labels, such as VerbNet or FrameNet (perhaps using SemLink or equivalents)? Maybe the issue arises because of size of the data, which require to rethink the architecture and hyperparameters...

Here is the final info for those who might also replicate the pertained model. image

Riccorl commented 3 years ago

I guess that you can use the same architecture and parameters as baseline when changing your dataset. I didn't experiment too much with hyper-parameters, but I guess you can squeeze some 0.X