banditelol / stog

AMR Parsing as Sequence-to-Graph Transduction, a fork for implementation using Stanza instead of CoreNLP
MIT License
0 stars 1 forks source link

Pipeline #1

Open banditelol opened 3 years ago

banditelol commented 3 years ago

AMR Parsing as Sequence-to-Graph Transduction

1. Environment Setup

The code has been tested on Python 3.6 and PyTorch 0.4.1. All other dependencies are listed in requirements.txt.

Via conda:

conda create -n stog python=3.6
source activate stog
pip install -r requirements.txt

2. Data Preparation

Download Artifacts:

./scripts/download_artifacts.sh

Bagian ini pada dasarnya mengunduh semua pretrained model yang digunakan di zhang, dan beberapa tools dari riset sebelumnya (ChunchuanLv) dan utilities dari websitenya Zhang sendiri

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/scripts/download_artifacts.sh#L6-L22

Assuming that you're working on AMR 2.0 (LDC2017T10), unzip the corpus to data/AMR/LDC2017T10, and make sure it has the following structure:

(stog)$ tree data/AMR/LDC2017T10 -L 2
data/AMR/LDC2017T10
├── data
│   ├── alignments
│   ├── amrs
│   └── frames
├── docs
│   ├── AMR-alignment-format.txt
│   ├── amr-guidelines-v1.2.pdf
│   ├── file.tbl
│   ├── frameset.dtd
│   ├── PropBank-unification-notes.txt
│   └── README.txt
└── index.html

Prepare training/dev/test data:

./scripts/prepare_data.sh -v 2 -p data/AMR/LDC2017T10

Persiapan yang dilakukan berupa pemindahan dataset AMR (tergantung dengan versi AMR yang digunakan) kedalam direktori yang sesuai (dibutuhkan oleh program ini)

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/scripts/prepare_data.sh#L42-L61

3. Feature Annotation

We use Stanford CoreNLP (version 3.9.2) for lemmatizing, POS tagging, etc.

First, start a CoreNLP server following the API documentation.

Then, annotate AMR sentences:

./scripts/annotate_features.sh data/AMR/amr_2.0

Anotasi yang dilakukan adalah tokenizing dengan proses berupa: https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/preprocess/feature_annotator.py#L18-L25

Nantinya akan menghasilkan anotasi yang berisi

4. Data Preprocessing

./scripts/preprocess_2.0.sh

Propbank

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/node_utils.py#L161-L169

5. Training

Make sure that you have at least two GeForce GTX TITAN X GPUs to train the full model.

python -u -m stog.commands.train params/stog_amr_2.0.yaml

6. Prediction

python -u -m stog.commands.predict \
    --archive-file ckpt-amr-2.0 \
    --weights-file ckpt-amr-2.0/best.th \
    --input-file data/AMR/amr_2.0/test.txt.features.preproc \
    --batch-size 32 \
    --use-dataset-reader \
    --cuda-device 0 \
    --output-file test.pred.txt \
    --silent \
    --beam-size 5 \
    --predictor STOG

7. Data Postprocessing

./scripts/postprocess_2.0.sh test.pred.txt

8. Evaluation

Note that the evaluation tool works on python2, so please make sure python2 is visible in your $PATH.

./scripts/compute_smatch.sh test.pred.txt data/AMR/amr_2.0/test.txt

Pre-trained Models

Here are pre-trained models: ckpt-amr-2.0.tar.gz and ckpt-amr-1.0.tar.gz. To use them for prediction, simply download & unzip them, and then run Step 6-8.

In case that you only need the pre-trained model prediction (i.e., test.pred.txt), you can find it in the download.

Acknowledgements

We adopted some modules or code snippets from AllenNLP, OpenNMT-py and NeuroNLP2. Thanks to these open-source projects!

License

MIT

banditelol commented 3 years ago

Proses Penggunaan PropBank

Inisialisasi Node Utils

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/node_utils.py#L252-L257

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/node_utils.py#L161-L167

Bisa dilihat, untuk setiap map lemma,frames di propbank, dihitung frequensi

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/node_utils.py#L265

Argumen2 tersebut akan digunakan dalam kelas NodeUtilities

Node Utils Digunakan dimana?

Di Sense Remover

https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/preprocess/sense_remover.py#L124-L132

Jadi kalau disini prosesnya:

Kesimpulan

Dalam satu XML propbank (misal go.xml) bisa memiliki beberapa lemma (go_off, go_out) yang masing-masing memiliki sense (frame) yang berbeda image

jadi akan terdapat mapping go_off - go.16 dan go_out - go.17

Jadi Propbank sendiri digunakan:

  1. Mendapatkan perhitungan seberapa sering satu lemma berkorespondensi dengan suatu frame
    • image
  2. Mendapatkan perhitungan seberapa sering satu frame berkorespondensi dengan suatu lemma
    • image

Dan kedua informasi tersebut digunakan untuk:

  1. Sense Removal (Preprocess): menghilangkan sense yang ada pada frame dari AMR sebelum dilakukan training. Yang menarik, dilakukan pengecekan terlebih dahulu bahwa frame tersebut bisa di restore kembali dari lemma yang dihasilkan, bila tidak bisa (berarti frame tersebut OOV) maka frame tersebut akan hanya dihilangkan sensenya mis. plant-01 → plant.
  2. Sense Restore (Postprocess): Mengembalikan sense dengan cara mengambil frame yang berkorespondensi dengan lemma hasil prediksi. Karena satu lemma bisa berkorespondensi dengan banyak frame, dari semua kemungkinan frame diambil frame dengan nilai tertinggi (nilai ini berdasarkan banyaknya kemunculan baik di propbank, maupun korpus training lainnya).
banditelol commented 3 years ago

Hasil run sejauh ini: image

Jadi kepikiran udaha ada python library buat amr ini blm ya?

banditelol commented 3 years ago

Untuk Rabu, coba di Run bagian Counter itu AMR Bahasa Inggris dan keterkaitan filenya ✅

banditelol commented 3 years ago
banditelol commented 3 years ago

Hasil Dari Paper:

image

Hasil Percobaan

Smatch

image

Unlabeled

image

No WSD

image

Detailed

image

banditelol commented 3 years ago

Resources

Ini resource yang dibutuhkan untuk menghasilkan frame_lemma dan lemma_frame counter

Propbank

dibutuhkan informasi frame yang ada terhadap suatu kata predikat saja.

Untuk sekarang predikat untuk training data, kalau untuk validasi dan testing apakah butuh propbank juga.

Struktur bagian mana dari propbank ini yang diperlukan

Verbalization

Aturan untuk membantu apakah suatu kata dengan imbuhan atau bentuk tertentu yang berbeda dari bentuk dasarnya baiknya dianggap sebagai verb atau tidak. Ini akan digunakan untuk menambah informasi frame - lemma juga

Ini masukin data verbalization

Train File

Kalau ini lihat dari dataset training untuk ngambil ada frame apa aja, terus dihilangin sensenya untuk jadi lemma.

Joints

Ini isinya list dari compound words, yang digunakan untuk proses feature annotation (training) ini akan digunakan waktu preprocessing terutama untuk nentuin POS Tag frase yang sebenernya compound words

Verb bisa dari propbank, coba cari bahasa indo ada corpusnya ga

Model

BERT

Menggunakan average pooling dari word piece untuk dapat embedding dari satu kata. Ini bisa dihandle make IndoBERT

GloVe

Ini bisa dibuild dulu dari korpus wikipedia atau liputan6 sih, tinggal milih korpus yang sesuai sama use case atau korpus yang umum aja

POS Tag

Named Entity Anonymization Indicator

CharCNN features

Ini masih belum paham

Proses Inferensi

banditelol commented 3 years ago

Bentuk instans input model

[2021-04-21 11:55:21,874 INFO] loading archive file ckpt-amr-2.0
[2021-04-21 11:55:21,877 INFO] Loading token dictionary from ckpt-amr-2.0\vocabulary.
[2021-04-21 11:55:21,919 INFO] Building the STOG Model...
[2021-04-21 11:55:21,921 INFO] loading archive file data/bert-base-cased
[2021-04-21 11:55:21,925 INFO] Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 28996
}

[2021-04-21 11:55:27,373 INFO] encoder_token: 18002
[2021-04-21 11:55:27,374 INFO] encoder_chars: 113
[2021-04-21 11:55:27,378 INFO] decoder_token: 12202
[2021-04-21 11:55:27,381 INFO] decoder_chars: 87
[2021-04-21 11:55:32,092 INFO] loading vocabulary file data/bert-base-cased/bert-base-cased-vocab.txt
[2021-04-21 11:55:32,218 INFO] instantiating registered subclass STOG of <class 'stog.predictors.predictor.Predictor'>
0it [00:00, ?it/s][2021-04-21 11:55:32,227 INFO] Reading instances from lines in file at: example.txt.preproc
[2021-04-21 11:55:32,241 INFO] BERT OOV  rate: 0.0000 (0/40)
[2021-04-21 11:55:32,242 INFO] POS tag coverage: 0.6333 (19/30)
1it [00:00, 55.53it/s]
input:  Instance with fields:
     src_tokens: TextField of length 28 with text: 
        [fleet, bump, fishing, boat, ., little, evil, NATIONALITY_1, ghost, stir-up, trouble, and, unrest,
        ., with, heart, of, thief, and, arrogant, form, ,, they, again, show, they, wolfish, appearance]
        and TokenIndexers : {'encoder_tokens': 'SingleIdTokenIndexer', 'encoder_characters': 'TokenCharactersIndexer'} 
     src_token_ids: ArrayField with shape: (40,). 
     src_token_subword_index: ArrayField with shape: (28, 8). 
     src_must_copy_tags: SequenceLabelField of length 28 with labels:
        [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        in namespace: 'must_copy_tags'. 
     tgt_tokens: TextField of length 30 with text: 
        [@start@, multi-sentence, bump-01, boat, fish-01, fleet, stir-up, ghost, country, "Japan", name,
        "Japan", little, evil, and, trouble, unrest, show-01, they, appearance, wolfish, they, again, and,
        heart, person, steal, form, arrogance, @end@]
        and TokenIndexers : {'decoder_tokens': 'SingleIdTokenIndexer', 'decoder_characters': 'TokenCharactersIndexer'} 
     src_pos_tags: SequenceLabelField of length 28 with labels:
        ['NNS', 'VBG', 'NN', 'NNS', '.', 'JJ', 'JJ', 'NNP', 'NNS', 'COMP', 'NN', 'CC', 'NN', '.', 'IN',
        'NNS', 'IN', 'NNS', 'CC', 'JJ', 'NN', ',', 'PRP', 'RB', 'VBP', 'PRP$', 'JJ', 'NN']
        in namespace: 'pos_tags'. 
     tgt_pos_tags: SequenceLabelField of length 30 with labels:
        ['@@UNKNOWN@@', '@@UNKNOWN@@', 'VBG', 'NNS', '@@UNKNOWN@@', 'NNS', 'COMP', 'NNS', '@@UNKNOWN@@',
        '@@UNKNOWN@@', '@@UNKNOWN@@', '@@UNKNOWN@@', 'JJ', 'JJ', 'CC', 'NN', 'NN', 'VBP', 'PRP', 'NN', 'JJ',
        'PRP', 'RB', 'CC', 'NNS', '@@UNKNOWN@@', '@@UNKNOWN@@', 'NN', '@@UNKNOWN@@', '@@UNKNOWN@@']
        in namespace: 'pos_tags'. 
     tgt_copy_indices: SequenceLabelField of length 30 with labels:
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0, 0, 0, 0]
        in namespace: 'coref_tags'. 
     tgt_copy_mask: SequenceLabelField of length 30 with labels:
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
        in namespace: 'coref_mask_tags'. 
     tgt_copy_map: AdjacencyField of length 30
        with indices:
        [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11),
        (12, 12), (13, 13), (14, 14), (15, 15), (16, 16), (17, 17), (18, 18), (19, 19), (20, 20), (21, 18),
        (22, 22), (23, 23), (24, 24), (25, 25), (26, 26), (27, 27), (28, 28), (29, 29)]

        and labels:
        None
        in namespace: 'labels'. 
     src_copy_indices: SequenceLabelField of length 30 with labels:
        [1, 1, 1, 5, 1, 2, 11, 10, 1, 1, 1, 1, 7, 8, 13, 12, 14, 1, 22, 26, 25, 22, 23, 13, 16, 1, 1, 20, 1,
        1]
        in namespace: 'source_copy_target_tags'. 
     src_copy_map: AdjacencyField of length 30
        with indices:
        [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11,
        12), (12, 13), (13, 14), (14, 6), (15, 15), (16, 16), (17, 17), (18, 18), (19, 13), (20, 19), (21,
        20), (22, 21), (23, 22), (24, 23), (25, 24), (26, 22), (27, 25), (28, 26)]

        and labels:
        None
        in namespace: 'labels'. 
     head_tags: SequenceLabelField of length 30 with labels:
        ['root', 'snt1', 'ARG1', 'purpose', 'ARG2', 'snt2', 'ARG0', 'mod', 'wiki', 'name', 'op1', 'mod',
        'mod', 'ARG1', 'op1', 'op2', 'snt3', 'ARG0', 'ARG1', 'mod', 'poss', 'mod', 'prep-with', 'op1',
        'mod', 'ARG0-of', 'op2', 'mod']
        in namespace: 'head_tags'. 
     head_indices: SequenceLabelField of length 30 with labels:
        [0, 1, 2, 3, 2, 1, 6, 7, 8, 8, 10, 7, 7, 6, 14, 14, 1, 17, 17, 19, 19, 17, 17, 23, 24, 25, 23, 27]
        in namespace: 'head_index_tags'. 
     src_tokens_str: MetadataField (print field.metadata to see specific information). 
     tgt_tokens_str: MetadataField (print field.metadata to see specific information). 
     src_copy_vocab: MetadataField (print field.metadata to see specific information). 
     tag_lut: MetadataField (print field.metadata to see specific information). 
     source_copy_invalid_ids: MetadataField (print field.metadata to see specific information). 
     amr: MetadataField (print field.metadata to see specific information). 

prediction:  # ::id bolt12_64556_5627.3 ::date 2012-12-04T18:01:17 ::annotator SDL-AMR-09 ::preferred
# ::snt Fleets bumping fishing boats. Little evil Japanese ghosts stirring up trouble and unrest. With hearts of thieves and arrogant form, they again show their wolfish appearance
# ::tokens ["Fleets", "bumping", "fishing", "boats", ".", "Little", "evil", "NATIONALITY_1", "ghosts", "stirring-up", "trouble", "and", "unrest", ".", "With", "hearts", "of", "thieves", "and", "arrogant", "form", ",", "they", "again", "show", "their", "wolfish", "appearance"]
# ::lemmas ["fleet", "bump", "fishing", "boat", ".", "little", "evil", "NATIONALITY_1", "ghost", "stir-up", "trouble", "and", "unrest", ".", "with", "heart", "of", "thief", "and", "arrogant", "form", ",", "they", "again", "show", "they", "wolfish", "appearance"]
# ::pos_tags ["NNS", "VBG", "NN", "NNS", ".", "JJ", "JJ", "NNP", "NNS", "COMP", "NN", "CC", "NN", ".", "IN", "NNS", "IN", "NNS", "CC", "JJ", "NN", ",", "PRP", "RB", "VBP", "PRP$", "JJ", "NN"]
# ::ner_tags ["O", "O", "O", "O", "O", "O", "O", "NATIONALITY", "O", "0", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
# ::abstract_map {"NATIONALITY_1": {"type": "named-entity", "span": "Japanese", "ner": "NATIONALITY", "ops": "Japan", "lemma": "Japan"}}
# ::tgt_ref multi-sentence fleet boat @@UNKNOWN@@ bump stir-up ghost evil little country NATIONALITY_1 and trouble-01 unrest show-01 they appear-01 and heart person steal form arrogant
# ::tgt_pred multi-sentence bump-01 boat fish-01 fleet stir-up ghost country "Japan" name "Japan" little evil and trouble unrest show-01 they appearance wolfish they again and heart person steal form arrogance
# ::save-date Sat Jan 10, 2015 ::file bolt12_64556_5627_3.txt
(vv1 / multi-sentence
      :snt1 (vv2 / fleet
            :consist-of (vv3 / boat)
            :mod (vv5 / bump))
      :snt2 (vv6 / stir-up
            :ARG0 (vv7 / ghost
                  :mod (vv8 / evil)
                  :mod (vv9 / little)
                  :mod (vv10 / country
                        :name (vv11 / NATIONALITY_1)))
            :ARG1 (vv12 / and
                  :op1 (vv13 / trouble-01)
                  :op2 (vv14 / unrest)))
      :snt3 (vv15 / show-01
            :ARG0 (vv16 / they)
            :ARG1 (vv17 / appear-01
                  :ARG1 (vv18 / and
                        :op1 (vv19 / heart
                              :part-of (vv20 / person
                                    :ARG0-of (vv21 / steal)))
                        :op2 (vv22 / form
                              :mod (vv23 / arrogant))))))
banditelol commented 3 years ago

Scope Propbank

Penggunaan Counter

Ini bisa dilihat di bagian update counter propbank https://github.com/banditelol/stog/blob/15461d45404eab817b21b43056344041c46d76a3/stog/data/dataset_readers/amr_parsing/node_utils.py#L161-L169

bisa dilihat kalau yang digunakan: propbank_reader

|-lemma_map
  |-lemma
  |-frames
    |-frame_1
    |- ...

Definisi frame_lemma

frame_lemma_set: each frame consists of two parts, frame lemma and frame sense, e.g., `run-01`. frame_lemma_set collects all frame lemmas.
lemma_map: besides frame lemmas, a frame could be invoked by other lemmas. Here we build a dict that maps a lemma to a set of frames it could invoke.

Update map

# Get the primary lemma of the frame.
        lemma = node.attrib['lemma'].replace('_', '-')  # AMR use dash.
        for child in node:
            if child.tag == 'roleset':
                # Split sense from frame id; get `frame_lemma` and `sense`.
                frame_id = child.attrib['id']
                if '.' not in frame_id:
                    parts = frame_id.split('-')
                    if len(parts) == 1:
                        frame_lemma = parts[0].replace('_', '-')
                        sense = None
                    else:
                        frame_lemma, sense = parts
                else:
                    frame_lemma, sense = frame_id.replace('_', '-').split('.')
                # Get frame id in AMR convention.
                frame = frame_id.replace('_', '-').replace('.', '-')    # AMR use dash
                # Put them together.
                frame_obj = Frame(frame, frame_lemma, sense)

                # Update
                self.frame_lemma_set.add(frame_lemma)
                self._update_lemma_map(self.lemma_map, lemma, frame_obj)

                aliases = child.find('aliases')
                if aliases:
                    for alias in aliases.findall('alias'):
                        alias_text = alias.text.replace('_', '-')
                        if alias_text != frame_lemma and alias_text not in self.lemma_map:
                            self._update_lemma_map(self.lemma_map, alias_text, frame_obj)

Jadi penasaran:

banditelol commented 3 years ago

Questions:

Kalau untuk training ada aturan yang diikutin via recategorizer

Kalau untuk Prediksi, yang digunakan text_anonymization, itu ada file yang isinya aturan untuk anonimisasi yang isinya json untuk ngasih tau kata-kata apa yang di anonimisasi jadi apa,

banditelol commented 3 years ago

Dari ibunya: kalau nemu paper mengenai propbank atau pembentukannya nanti heads up di grup

https://dl.acm.org/doi/pdf/10.5555/1855450.1855454 https://www.aclweb.org/anthology/2020.findings-emnlp.38.pdf https://github.com/System-T/UniversalPropositions https://www.google.com/search?q=propbank+framesets+paper&oq=propbank+framesets+paper&aqs=chrome..69i57j0i22i30j69i60.4897j0j7&sourceid=chrome&ie=UTF-8

banditelol commented 3 years ago

2021-06-25

Recategorize vs Anonymization

Kalau untuk training ada aturan yang diikutin via recategorizer

Kalau untuk Prediksi, yang digunakan text_anonymization, itu ada file yang isinya aturan untuk anonimisasi yang isinya json untuk ngasih tau kata-kata apa yang di anonimisasi jadi apa,

Sense Removal

Membangkitkan pasangan dari nodes

Lemma to Frame dan Frame to Lemma

Data yang dibutuhkan

Tag propbank yang kepake

<!DOCTYPE frameset SYSTEM "frameset.dtd">
<frameset>
  <predicate lemma="make">
    <roleset id="make.01" name="create">
      <aliases>
        <alias framenet="" pos="v" verbnet="">make</alias>
        <alias framenet="" pos="n" verbnet="">make</alias>
        <alias framenet="" pos="n" verbnet="">making</alias>
      </aliases>

Next To Do

banditelol commented 3 years ago

Next To Do

banditelol commented 3 years ago

Tambahan

banditelol commented 3 years ago

2021-07-14

banditelol commented 3 years ago

2021-07-22

banditelol commented 3 years ago

Training