Open banditelol opened 3 years ago
propbank_dir
merupakan directory yang berisi propbank frame dalam bentuk xml.propbank_base_freq
merupakan nilai dasar frequensi propbank yang digunakan (ini menentukan distribusi dasar dari propbank dan akan menjadi biasnya) propbank_bonus
merupakan bonus yang ditambahkan setiap ditemukan frame propbank (akan dibahas nanti)Bisa dilihat, untuk setiap map lemma,frames di propbank, dihitung frequensi
Argumen2 tersebut akan digunakan dalam kelas NodeUtilities
Jadi kalau disini prosesnya:
Dalam satu XML propbank (misal go.xml) bisa memiliki beberapa lemma (go_off, go_out) yang masing-masing memiliki sense (frame) yang berbeda
jadi akan terdapat mapping go_off - go.16
dan go_out - go.17
Jadi Propbank sendiri digunakan:
Dan kedua informasi tersebut digunakan untuk:
plant-01 → plant
.Hasil run sejauh ini:
Jadi kepikiran udaha ada python library buat amr ini blm ya?
Untuk Rabu, coba di Run bagian Counter itu AMR Bahasa Inggris dan keterkaitan filenya ✅
[ ] Coba untuk step by step dari node_prediction POS nya buat apa
[x] Satu kalimat dari awal sampe akhir untuk bisa tahu prosesnya seperti apa
[x] ada uang anotasi untuk mahasiswa TA : Nisrina sebelum lulus, untuk dipegang dulu
[ ] Amany pake dataset ini untuk sinonimnya https://github.com/keyreply/Bahasa-Indo-NLP-Dataset
Ini resource yang dibutuhkan untuk menghasilkan frame_lemma dan lemma_frame counter
dibutuhkan informasi frame yang ada terhadap suatu kata predikat saja.
Untuk sekarang predikat untuk training data, kalau untuk validasi dan testing apakah butuh propbank juga.
Struktur bagian mana dari propbank ini yang diperlukan
Aturan untuk membantu apakah suatu kata dengan imbuhan atau bentuk tertentu yang berbeda dari bentuk dasarnya baiknya dianggap sebagai verb atau tidak. Ini akan digunakan untuk menambah informasi frame - lemma juga
Ini masukin data verbalization
Kalau ini lihat dari dataset training untuk ngambil ada frame apa aja, terus dihilangin sensenya untuk jadi lemma.
Ini isinya list dari compound words, yang digunakan untuk proses feature annotation (training) ini akan digunakan waktu preprocessing terutama untuk nentuin POS Tag frase yang sebenernya compound words
Verb bisa dari propbank, coba cari bahasa indo ada corpusnya ga
Menggunakan average pooling dari word piece untuk dapat embedding dari satu kata. Ini bisa dihandle make IndoBERT
Ini bisa dibuild dulu dari korpus wikipedia atau liputan6 sih, tinggal milih korpus yang sesuai sama use case atau korpus yang umum aja
Ini masih belum paham
[2021-04-21 11:55:21,874 INFO] loading archive file ckpt-amr-2.0
[2021-04-21 11:55:21,877 INFO] Loading token dictionary from ckpt-amr-2.0\vocabulary.
[2021-04-21 11:55:21,919 INFO] Building the STOG Model...
[2021-04-21 11:55:21,921 INFO] loading archive file data/bert-base-cased
[2021-04-21 11:55:21,925 INFO] Model config {
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 28996
}
[2021-04-21 11:55:27,373 INFO] encoder_token: 18002
[2021-04-21 11:55:27,374 INFO] encoder_chars: 113
[2021-04-21 11:55:27,378 INFO] decoder_token: 12202
[2021-04-21 11:55:27,381 INFO] decoder_chars: 87
[2021-04-21 11:55:32,092 INFO] loading vocabulary file data/bert-base-cased/bert-base-cased-vocab.txt
[2021-04-21 11:55:32,218 INFO] instantiating registered subclass STOG of <class 'stog.predictors.predictor.Predictor'>
0it [00:00, ?it/s][2021-04-21 11:55:32,227 INFO] Reading instances from lines in file at: example.txt.preproc
[2021-04-21 11:55:32,241 INFO] BERT OOV rate: 0.0000 (0/40)
[2021-04-21 11:55:32,242 INFO] POS tag coverage: 0.6333 (19/30)
1it [00:00, 55.53it/s]
input: Instance with fields:
src_tokens: TextField of length 28 with text:
[fleet, bump, fishing, boat, ., little, evil, NATIONALITY_1, ghost, stir-up, trouble, and, unrest,
., with, heart, of, thief, and, arrogant, form, ,, they, again, show, they, wolfish, appearance]
and TokenIndexers : {'encoder_tokens': 'SingleIdTokenIndexer', 'encoder_characters': 'TokenCharactersIndexer'}
src_token_ids: ArrayField with shape: (40,).
src_token_subword_index: ArrayField with shape: (28, 8).
src_must_copy_tags: SequenceLabelField of length 28 with labels:
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
in namespace: 'must_copy_tags'.
tgt_tokens: TextField of length 30 with text:
[@start@, multi-sentence, bump-01, boat, fish-01, fleet, stir-up, ghost, country, "Japan", name,
"Japan", little, evil, and, trouble, unrest, show-01, they, appearance, wolfish, they, again, and,
heart, person, steal, form, arrogance, @end@]
and TokenIndexers : {'decoder_tokens': 'SingleIdTokenIndexer', 'decoder_characters': 'TokenCharactersIndexer'}
src_pos_tags: SequenceLabelField of length 28 with labels:
['NNS', 'VBG', 'NN', 'NNS', '.', 'JJ', 'JJ', 'NNP', 'NNS', 'COMP', 'NN', 'CC', 'NN', '.', 'IN',
'NNS', 'IN', 'NNS', 'CC', 'JJ', 'NN', ',', 'PRP', 'RB', 'VBP', 'PRP$', 'JJ', 'NN']
in namespace: 'pos_tags'.
tgt_pos_tags: SequenceLabelField of length 30 with labels:
['@@UNKNOWN@@', '@@UNKNOWN@@', 'VBG', 'NNS', '@@UNKNOWN@@', 'NNS', 'COMP', 'NNS', '@@UNKNOWN@@',
'@@UNKNOWN@@', '@@UNKNOWN@@', '@@UNKNOWN@@', 'JJ', 'JJ', 'CC', 'NN', 'NN', 'VBP', 'PRP', 'NN', 'JJ',
'PRP', 'RB', 'CC', 'NNS', '@@UNKNOWN@@', '@@UNKNOWN@@', 'NN', '@@UNKNOWN@@', '@@UNKNOWN@@']
in namespace: 'pos_tags'.
tgt_copy_indices: SequenceLabelField of length 30 with labels:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 18, 0, 0, 0, 0, 0, 0, 0, 0]
in namespace: 'coref_tags'.
tgt_copy_mask: SequenceLabelField of length 30 with labels:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0]
in namespace: 'coref_mask_tags'.
tgt_copy_map: AdjacencyField of length 30
with indices:
[(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11),
(12, 12), (13, 13), (14, 14), (15, 15), (16, 16), (17, 17), (18, 18), (19, 19), (20, 20), (21, 18),
(22, 22), (23, 23), (24, 24), (25, 25), (26, 26), (27, 27), (28, 28), (29, 29)]
and labels:
None
in namespace: 'labels'.
src_copy_indices: SequenceLabelField of length 30 with labels:
[1, 1, 1, 5, 1, 2, 11, 10, 1, 1, 1, 1, 7, 8, 13, 12, 14, 1, 22, 26, 25, 22, 23, 13, 16, 1, 1, 20, 1,
1]
in namespace: 'source_copy_target_tags'.
src_copy_map: AdjacencyField of length 30
with indices:
[(0, 1), (1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10), (10, 11), (11,
12), (12, 13), (13, 14), (14, 6), (15, 15), (16, 16), (17, 17), (18, 18), (19, 13), (20, 19), (21,
20), (22, 21), (23, 22), (24, 23), (25, 24), (26, 22), (27, 25), (28, 26)]
and labels:
None
in namespace: 'labels'.
head_tags: SequenceLabelField of length 30 with labels:
['root', 'snt1', 'ARG1', 'purpose', 'ARG2', 'snt2', 'ARG0', 'mod', 'wiki', 'name', 'op1', 'mod',
'mod', 'ARG1', 'op1', 'op2', 'snt3', 'ARG0', 'ARG1', 'mod', 'poss', 'mod', 'prep-with', 'op1',
'mod', 'ARG0-of', 'op2', 'mod']
in namespace: 'head_tags'.
head_indices: SequenceLabelField of length 30 with labels:
[0, 1, 2, 3, 2, 1, 6, 7, 8, 8, 10, 7, 7, 6, 14, 14, 1, 17, 17, 19, 19, 17, 17, 23, 24, 25, 23, 27]
in namespace: 'head_index_tags'.
src_tokens_str: MetadataField (print field.metadata to see specific information).
tgt_tokens_str: MetadataField (print field.metadata to see specific information).
src_copy_vocab: MetadataField (print field.metadata to see specific information).
tag_lut: MetadataField (print field.metadata to see specific information).
source_copy_invalid_ids: MetadataField (print field.metadata to see specific information).
amr: MetadataField (print field.metadata to see specific information).
prediction: # ::id bolt12_64556_5627.3 ::date 2012-12-04T18:01:17 ::annotator SDL-AMR-09 ::preferred
# ::snt Fleets bumping fishing boats. Little evil Japanese ghosts stirring up trouble and unrest. With hearts of thieves and arrogant form, they again show their wolfish appearance
# ::tokens ["Fleets", "bumping", "fishing", "boats", ".", "Little", "evil", "NATIONALITY_1", "ghosts", "stirring-up", "trouble", "and", "unrest", ".", "With", "hearts", "of", "thieves", "and", "arrogant", "form", ",", "they", "again", "show", "their", "wolfish", "appearance"]
# ::lemmas ["fleet", "bump", "fishing", "boat", ".", "little", "evil", "NATIONALITY_1", "ghost", "stir-up", "trouble", "and", "unrest", ".", "with", "heart", "of", "thief", "and", "arrogant", "form", ",", "they", "again", "show", "they", "wolfish", "appearance"]
# ::pos_tags ["NNS", "VBG", "NN", "NNS", ".", "JJ", "JJ", "NNP", "NNS", "COMP", "NN", "CC", "NN", ".", "IN", "NNS", "IN", "NNS", "CC", "JJ", "NN", ",", "PRP", "RB", "VBP", "PRP$", "JJ", "NN"]
# ::ner_tags ["O", "O", "O", "O", "O", "O", "O", "NATIONALITY", "O", "0", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O", "O"]
# ::abstract_map {"NATIONALITY_1": {"type": "named-entity", "span": "Japanese", "ner": "NATIONALITY", "ops": "Japan", "lemma": "Japan"}}
# ::tgt_ref multi-sentence fleet boat @@UNKNOWN@@ bump stir-up ghost evil little country NATIONALITY_1 and trouble-01 unrest show-01 they appear-01 and heart person steal form arrogant
# ::tgt_pred multi-sentence bump-01 boat fish-01 fleet stir-up ghost country "Japan" name "Japan" little evil and trouble unrest show-01 they appearance wolfish they again and heart person steal form arrogance
# ::save-date Sat Jan 10, 2015 ::file bolt12_64556_5627_3.txt
(vv1 / multi-sentence
:snt1 (vv2 / fleet
:consist-of (vv3 / boat)
:mod (vv5 / bump))
:snt2 (vv6 / stir-up
:ARG0 (vv7 / ghost
:mod (vv8 / evil)
:mod (vv9 / little)
:mod (vv10 / country
:name (vv11 / NATIONALITY_1)))
:ARG1 (vv12 / and
:op1 (vv13 / trouble-01)
:op2 (vv14 / unrest)))
:snt3 (vv15 / show-01
:ARG0 (vv16 / they)
:ARG1 (vv17 / appear-01
:ARG1 (vv18 / and
:op1 (vv19 / heart
:part-of (vv20 / person
:ARG0-of (vv21 / steal)))
:op2 (vv22 / form
:mod (vv23 / arrogant))))))
Ini bisa dilihat di bagian update counter propbank https://github.com/banditelol/stog/blob/15461d45404eab817b21b43056344041c46d76a3/stog/data/dataset_readers/amr_parsing/node_utils.py#L161-L169
bisa dilihat kalau yang digunakan: propbank_reader
|-lemma_map
|-lemma
|-frames
|-frame_1
|- ...
frame_lemma_set: each frame consists of two parts, frame lemma and frame sense, e.g., `run-01`. frame_lemma_set collects all frame lemmas.
lemma_map: besides frame lemmas, a frame could be invoked by other lemmas. Here we build a dict that maps a lemma to a set of frames it could invoke.
# Get the primary lemma of the frame.
lemma = node.attrib['lemma'].replace('_', '-') # AMR use dash.
for child in node:
if child.tag == 'roleset':
# Split sense from frame id; get `frame_lemma` and `sense`.
frame_id = child.attrib['id']
if '.' not in frame_id:
parts = frame_id.split('-')
if len(parts) == 1:
frame_lemma = parts[0].replace('_', '-')
sense = None
else:
frame_lemma, sense = parts
else:
frame_lemma, sense = frame_id.replace('_', '-').split('.')
# Get frame id in AMR convention.
frame = frame_id.replace('_', '-').replace('.', '-') # AMR use dash
# Put them together.
frame_obj = Frame(frame, frame_lemma, sense)
# Update
self.frame_lemma_set.add(frame_lemma)
self._update_lemma_map(self.lemma_map, lemma, frame_obj)
aliases = child.find('aliases')
if aliases:
for alias in aliases.findall('alias'):
alias_text = alias.text.replace('_', '-')
if alias_text != frame_lemma and alias_text not in self.lemma_map:
self._update_lemma_map(self.lemma_map, alias_text, frame_obj)
Jadi penasaran:
Questions:
Kalau untuk training ada aturan yang diikutin via recategorizer
Kalau untuk Prediksi, yang digunakan text_anonymization, itu ada file yang isinya aturan untuk anonimisasi yang isinya json untuk ngasih tau kata-kata apa yang di anonimisasi jadi apa,
[ ] Sense removal ga kena untuk beberapa kalimat, coba test di yang satu kalimat aja
[ ] Bagaimana membangkitkan pasangan dari node setelah dapat node
[ ] Hitung smatch dari target vs reference
[x] Lemma-Frame counter
Dari ibunya: kalau nemu paper mengenai propbank atau pembentukannya nanti heads up di grup
https://dl.acm.org/doi/pdf/10.5555/1855450.1855454 https://www.aclweb.org/anthology/2020.findings-emnlp.38.pdf https://github.com/System-T/UniversalPropositions https://www.google.com/search?q=propbank+framesets+paper&oq=propbank+framesets+paper&aqs=chrome..69i57j0i22i30j69i60.4897j0j7&sourceid=chrome&ie=UTF-8
Kalau untuk training ada aturan yang diikutin via recategorizer
Kalau untuk Prediksi, yang digunakan text_anonymization, itu ada file yang isinya aturan untuk anonimisasi yang isinya json untuk ngasih tau kata-kata apa yang di anonimisasi jadi apa,
<!DOCTYPE frameset SYSTEM "frameset.dtd">
<frameset>
<predicate lemma="make">
<roleset id="make.01" name="create">
<aliases>
<alias framenet="" pos="v" verbnet="">make</alias>
<alias framenet="" pos="n" verbnet="">make</alias>
<alias framenet="" pos="n" verbnet="">making</alias>
</aliases>
AMR Parsing as Sequence-to-Graph Transduction
1. Environment Setup
The code has been tested on Python 3.6 and PyTorch 0.4.1. All other dependencies are listed in requirements.txt.
Via conda:
2. Data Preparation
Download Artifacts:
https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/scripts/download_artifacts.sh#L6-L22
Assuming that you're working on AMR 2.0 (LDC2017T10), unzip the corpus to
data/AMR/LDC2017T10
, and make sure it has the following structure:Prepare training/dev/test data:
https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/scripts/prepare_data.sh#L42-L61
3. Feature Annotation
We use Stanford CoreNLP (version 3.9.2) for lemmatizing, POS tagging, etc.
First, start a CoreNLP server following the API documentation.
Then, annotate AMR sentences:
4. Data Preprocessing
Propbank
https://github.com/banditelol/stog/blob/ff870535d344152091d791378e0ec4387c8f7573/stog/data/dataset_readers/amr_parsing/node_utils.py#L161-L169
5. Training
Make sure that you have at least two GeForce GTX TITAN X GPUs to train the full model.
6. Prediction
7. Data Postprocessing
8. Evaluation
Note that the evaluation tool works on
python2
, so please make surepython2
is visible in your$PATH
.Pre-trained Models
Here are pre-trained models: ckpt-amr-2.0.tar.gz and ckpt-amr-1.0.tar.gz. To use them for prediction, simply download & unzip them, and then run Step 6-8.
In case that you only need the pre-trained model prediction (i.e.,
test.pred.txt
), you can find it in the download.Acknowledgements
We adopted some modules or code snippets from AllenNLP, OpenNMT-py and NeuroNLP2. Thanks to these open-source projects!
License
MIT