Конвертация CoNLL-U в JSON

В настоящее время большинство корпусов представлено в CoNLL-U формате (nerus, syntagrus etc).

Конвертация cli инструментами spacy, например:

spacy convert nerus_lenta.conllu ./nerus_json -l ru

или

spacy convert nerus_lenta.conllu ./nerus_json -l ru -c ner

приводит к ошибкам во время тренировки модели:

spacy train ru /home/sergey/Py_Spacy_RU/test /home/sergey/Py_Spacy_RU/data/nerus/nerus_json/try.json /home/sergey/Py_Spacy_RU/data/nerus/nerus_json/try.json --base-model /home/sergey/Py_Spacy_RU/ru2 --n-iter 20 --n-early-stopping 5 --pipeline 'ner'

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 248, in train
    for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/util.py", line 535, in minibatch_by_words
    doc, gold = next(items)
  File "gold.pyx", line 217, in train_docs
  File "gold.pyx", line 233, in iter_gold_docs
  File "gold.pyx", line 253, in spacy.gold.GoldCorpus._make_golds
  File "gold.pyx", line 443, in spacy.gold.GoldParse.from_annot_tuples
  File "gold.pyx", line 593, in spacy.gold.GoldParse.__init__
ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {3, 5}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 431, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

Хотелось бы чтобы коллеги поделились опытом, кто как конвертирует conllu в json. Есть примеры успешной конвертации с последующей тренировкой в cli spacy?

К слову, в разметке Nerus есть циклы и корней может быть не 1. По-хорошему нужно было такие предложения убрать, но пока так.

@kuk Я правильно понимаю, что если spacy train ... вылетает с ошибкой про циклы -- это "нормально", т.е. такие предложения надо убирать "ручками"? Как насчет других ошибок, например:

KeyError: "[E022] Could not find a transition with the name 'U-' in the NER model."

для syntagrus?

Собственно вопрос остается: кто как конвертирует conluu в json (code-wise...)

Про spacy не подскажу. Судя по Invalid gold-standard parse tree. Found cycle between word IDs нужно убирать

Это проблема именно nerus, с ней я буду разбираться на следующей неделе. А с syntagrus и датасетами GramEval2020 всё хорошо.

Касательно syntagrus. Сырые данные (формат conllu).

Попытка 1.

spacy convert ru_syntagrus-ud-train.conllu ./ -c ner
✔ Generated output file (1 documents)
ru_syntagrus-ud-train.json

spacy convert ru_syntagrus-ud-test.conllu ./ -c ner
✔ Generated output file (1 documents)
ru_syntagrus-ud-test.json

spacy train ru /home/sergey/Py_Spacy_RU/test  /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-train.json /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-test.json --base-model /home/sergey/Py_Spacy_RU/ru2 --n-iter 20 --n-early-stopping 5 --pipeline 'ner'
Training pipeline: ['ner']
Starting with base model '/home/sergey/Py_Spacy_RU/ru2'
Counting training words (limit=0)

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
✔ Saved model to output directory                                                                                                      
/home/sergey/Py_Spacy_RU/test/model-final

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/language.py", line 475, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 414, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 517, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 106, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 165, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'U-03Anketa.xml_1' in the NER model."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 425, in _collate_best_model
    bests[component] = _find_best(output_path, component)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 444, in _find_best
    accs = srsly.read_json(epoch_model / "accuracy.json")
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/srsly/_json_api.py", line 50, in read_json
    file_path = force_path(location)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/srsly/util.py", line 21, in force_path
    raise ValueError("Can't read file: {}".format(location))
ValueError: Can't read file: /home/sergey/Py_Spacy_RU/test/model-best/accuracy.json

Ошибка: KeyError: "[E022] Could not find a transition with the name 'U-03Anketa.xml_1' in the NER model." Насколько я понимаю, конвертер spacy некорректно интерпретирует данные за хэштегом. И еще напрягает что в json генерируется всего 1 документ.

Попытка 2.

Если хэштэг (имя источника?) поменять на целочисленное значение, вылетает ошибка со следующим хэштегом:

awk -v c=1 'sub(/^# sent_id = .*xml.*/, "# sent_id = " c) {c++}; {print}' ru_syntagrus-ud-train.conllu > ru_syntagrus-ud-train_w_xml.conllu
awk -v c=1 'sub(/^# sent_id = .*xml.*/, "# sent_id = " c) {c++}; {print}' ru_syntagrus-ud-test.conllu > ru_syntagrus-ud-test_w_xml.conllu

spacy train ru /home/sergey/Py_Spacy_RU/test  /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-train_w_xml.json /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-test_w_xml.json --base-model /home/sergey/Py_Spacy_RU/ru2 --n-iter 20 --n-early-stopping 5 --pipeline 'ner'

Training pipeline: ['ner']
Starting with base model '/home/sergey/Py_Spacy_RU/ru2'
Counting training words (limit=0)

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
✔ Saved model to output directory                                                                                                      
/home/sergey/Py_Spacy_RU/test/model-final

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/language.py", line 475, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 414, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 517, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 106, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 165, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'U-' in the NER model."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 425, in _collate_best_model
    bests[component] = _find_best(output_path, component)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 444, in _find_best
    accs = srsly.read_json(epoch_model / "accuracy.json")
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/srsly/_json_api.py", line 50, in read_json
    file_path = force_path(location)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/srsly/util.py", line 21, in force_path
    raise ValueError("Can't read file: {}".format(location))
ValueError: Can't read file: /home/sergey/Py_Spacy_RU/test/model-best/accuracy.json

Собственно вопрос: каким образом конвертировать conluu в json

@sbushmanov Но в syntagrus нет данных для NER.

ок, по syntagrus вопрос снимается...

Конвертер spacy не любит комбинацию Tag=... в дополнительном поле conll-u формата, поэтому сначала убираем Tag=, затем конвертируем с дефолтными настройками:

sed 's/Tag=//g' nerus_lenta.conllu > nerus_lenta_wo_Tag.conllu
spacy convert -n 5 -t json nerus_lenta_wo_Tag.conllu ./nerus_json

На сервере Ubuntu 128Gb Ram и 128 Gb swap памяти не хватило, пришлось добавлять swap. Результирующий файл 35Gb.

@sbushmanov спасибо. я пока остановился на этапе, что 64 гигабайта не хватает. буду разбираться позже на этой неделе, после #5 и #12 . можно было разрезать на части, и/или сохранить потом в jsonl . Ну и ещё по тегам бери https://github.com/buriy/spacy-ru/blob/v2.3/convert.sh , эта конвертация тебе пригодится, иначе морфология не будет учиться. Да и вообще, смотри скрипты для обучения и тестирования качества в https://github.com/buriy/spacy-ru/blob/v2.3/Makefile .

Скрипт для резки на части:

awk 'BEGIN{nParMax=3;npar=0;nFile=0}
     /^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}}
     {print $0 > "foo."nFile}'  original.conllu

где nParMax задает количество "параграфов" в документе. "Параграф" == документ/параграф/предложение, отделенное пустой строкой. ~ 1'000'000 в нашем случае (± 50%)

Касательно конвертации данных в формат, пригодный для тренировки ner модели.

На входе: файл формата CoNLL-U:

Специификация формата.
- 9 обязательных полей. 10ое поле -- дополнительная инфо (интерпретируется spacy конвертером как ner, в нашем случае)
- пример источника данных

На выходе:

train.json/test.json, пригодные для тренировки ner модели в cli интерфейсе spacy.

Шаг 1

Грузим и распаковываем данные (конвертер spacy не понимает запакованные данные)

wget https://storage.yandexcloud.net/natasha-nerus/data/nerus_lenta.conllu.gz
gunzip -c nerus_lenta.conllu.gz

Шаг 2

Spacy конвертер требует память в 15÷20 больше размера исходника (RAM+swap) Поэтому начинаем с разрезки на части:

awk 'BEGIN {nParMax = 10000; npar = 0 ;nFile =0}
        /^$/{npar++;if(npar==nParMax){nFile++;npar=0;next}}
        {print $0 > "nerus_lenta_split_"nFile".conllu"}'  nerus_lenta.conllu

После разрезки рекомендую проверить что файл разрезан на "целые" части (см. начало/конец файла).

Шаг 3

Spacy конвертер не понимает значение "Tag=O" в поле для ner, Tag= необходимо отрезать

sed -i 's/Tag=//g' nerus_lenta_split*

Шаг 4

Конвертируем в .json

spacy convert -n 5 nerus_lenta_split_0.conllu > ./nerus_json/nerus_lenta_sample_train.json
spacy convert -n 5 nerus_lenta_split_1.conllu > ./nerus_json/nerus_lenta_sample_test.json

Далее работаем только с train/test

Шаг 5

Цитируя коллегу из экспложн:

Hmm, I think it tries to read in all the available annotation in case you might be training a parser later. I would just remove the "dep" and "head" values for all the tokens from the corpus in the JSON files to get around this if you're only training an NER model (link

убираем строки с "dep" and "head"

sed  '/"head"\|"dep"/d' nerus_lenta_sample_train.json > train_.json
sed  '/"head"\|"dep"/d' nerus_lenta_sample_test.json > test_.json

Шаг 6

Т.к. spacy конвертер конвертирует кириллицу в unescaped unicode, и соответственно неизвестно (для меня) как в дальнейшем этот юникод обрабатывается, конвертируем в нормальную читаемую кириллицу

python -c 'import json;file= open("./train_.json","r");data=json.load(file);file.close(); file=open("./train.json","w");json.dump(data,file,indent=2,ensure_ascii=False);'
python -c 'import json;file= open("./test_.json","r");data=json.load(file);file.close(); file=open("./test.json","w");json.dump(data,file,indent=2,ensure_ascii=False);'

Имеем train.json/test.json

Шаг 7

Тренируем ner (можно добавить векторов -v)

spacy train -b /home/sergey/Py_Spacy_RU/ru2 -v /home/sergey/Py_Rasa_Rus/fasttext/cc.ru.300 ru -n 20 -ne 5 -p ner /home/sergey/Py_Spacy_RU/out train.json test.json

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
  0       0.000    2204.211    0.000   94.438   94.092   94.265    0.000  100.000    37517        0                                    
  1       0.000    1511.409    0.000   94.381   94.200   94.291    0.000  100.000    44223        0                                    
  2       0.000    1225.054    0.000   94.488   94.275   94.382    0.000  100.000    43401        0                                    
  3       0.000     959.977    0.000   94.621   94.242   94.431    0.000  100.000    43938        0                                    
  4       0.000     808.909    0.000   94.548   94.350   94.449    0.000  100.000    37187        0                                    
  5       0.000     719.143    0.000   94.546   94.175   94.360    0.000  100.000    37097        0                                    
  6       0.000     646.187    0.000   94.356   94.442   94.399    0.000  100.000    37760        0                                    
  7       0.000     578.597    0.000   94.194   94.367   94.280    0.000  100.000    36287        0                                    
  8       0.000     564.655    0.000   94.225   94.367   94.296    0.000  100.000    42891        0                                    
  9       0.000     529.465    0.000   94.295   94.200   94.247    0.000  100.000    36090        0                                    
Early stopping, best iteration is: 4
Best score = 94.44892209096467; Final iteration score = 94.24732310734218
✔ Saved model to output directory
/home/sergey/Py_Spacy_RU/out/model-final
✔ Created best model

На что обратить внимание:

размер файла, на которые Вы режете. Питон поднимает данные с диска в память, поэтому очень требователен к объему доступной памяти. Инструменты командной строки обрабатывают файлы более эффективно (построчно)
обработка эмотиконов

buriy / spacy-ru

Конвертация CoNLL-U в JSON #24