Sprakbanken / grew_ndt2ud

2 stars 0 forks source link

Convert NDT to UD with Grew

This repo contains scripts and rule files to convert syntactic and morphological annotations from the Norwegian dependency treebank NDT to Universal Dependencies UD.

The rules are written with Grew which needs to be installed prior to running the conversion script.

Setup

Convert the treebank

./convert_ndt2ud.sh -v

The script can take three optional arguments:

flag valid arguments description
-l nb, nn 2 letter language code. Default is nb.
-p dev, test, train, gold Dataset split (partition). Default is gold, ie. the gold corpus selection of 200 manually corrected sentences.
-v Visualize the differences between the last official UD version and the new converted conllu file with MaltEval.

Development process

The rules were developed with the following step-by-step approach.

  1. Run Grew with the main strategy file:

    LANG=nb
    PARTITION=dev #train
    NDT_FILE=data/ndt_nb_${PARTITION}_udmorph.conllu
    CONVERTED=data/grew_output_${PARTITION}.conllu
    
    grew transform \
      -i  $NDT_FILE \
      -o  $CONVERTED \
      -grs  rules/NDT_to_UD.grs \
      -strat "main_$LANG" \
      -safe_commands
  2. Fix punctuation:

    We use udapi udapi + our own post processing rules to fix head attachment and direction of relations to the sentence internal punctuations.

    cat $CONVERTED | udapy -s ud.FixPunct > tmp.conllu
    
    grew transform \
    -i tmp.conllu \
    -o $CONVERTED \
    -grs rules/NDT_to_UD.grs \
    -strat "postfix" \
    -safe_commands
    
    # Remove comment line with column names
    sed -i 1d $CONVERTED
  3. Validate the output with UD's validation script:

    python ../tools/validate.py --max-err 0 --lang no $CONVERTED 2>&1 | tee validation-report_ndt2ud.txt
    python utils/extract_errorlines.py -f validation-report_ndt2ud.txt
  4. Compare the result with a previous version of UD

    Remove comment lines from the file before running it through MaltEval.

    python utils/parse_conllu.py -rc -f $CONVERTED -o tmp.conllu

    a. Relation statistics

    Swap the commented METRIC line to score the relation accuracy with or without dependency labels.

      # UAS / Unlabelled Accuracy Score: whether a directed relation R(x,y) exists between the same nodes x, y in the other treebank
      # LAS / Labelled Accuracy Score: whether the labelled, directed relation R(x,y) exists between nodes x,y
      METRIC=LAS
      #METRIC=UAS
      UD_OFFICIAL=data/${LANG}-ud-${PARTITION}_uten_hash.conllu
    
      java -jar dist-20141005/lib/MaltEval.jar \
        -s tmp.conllu \
        -g $UD_OFFICIAL \
        --GroupBy Deprel \
        --Metric $METRIC \
      > conversion_stats_${LANG}_${PARTITION}_${METRIC}.txt

    b. Visualize and compare sentence graphs in MaltEval

      java -jar dist-20141005/lib/MaltEval.jar -s tmp.conllu -g $UD_OFFICIAL -v 1

Grew rules

The rules-folder contains grs-files with rules and strategies which are applied in a certain order, as defined in the main_nb and main_nn strategies in NDT_to_UD.grs.

See also the Grew documentation on commands for more information.

Match sentences with Grew pattens

We also used grew grep to match sentences and develop request patterns for the rules, to ensure we targeted the correct structures.

grew grep -request rules/testpattern.req -i $NDT_FILE > pattern_matches.json

Data selection

We gathered a few example sentences in the data/sentences folder to try out the effects of different patterns, rules and strategies. Example code to extract the matched sentences in pattern_matches.json from NDT to a separate conllu file can be found in the jupyter notebook process_NDT.ipynb.

Test strategy

grew transform \
  -i  data/sentences/a \
  -o  data/output.conll \
  -grs  rules/teststrategy.grs \
  -strat test \
  -safe_commands

Utilities

python utils/convert_morph.py -f 'data/gullkorpus/2019_gullkorpus_ndt.conllu' -o 'data/gullkorpus/2019_gullkorpus_ndt_udmorph.conllu'
python utils/parse_conllu.py -rc -f data/gullkorpus/2019_gullkorpus_ndt.conllu -o data/gullkorpus/2019_gullkorpus_ndt_uten_hash.conllu
cat $CONVERTED | udapy -s ud.SetSpaceAfterFromText ud.FixPunct ud.FixRightheaded ud.FixLeaf > out.conllu

References