Can't seem to replicate perplexity for KnowBert-Wiki & KnowBert-W+W

Hello AllenAI team,

First of all, I'd like to say I find this approach brilliant and fascinating and I would like to thank you for your work.

I have been experimenting with KnowBert for a while now, and I seem to regularly run into perplexity issues with my custom variations of KnowBert. Without access to the original "pre-training" Wikipedia+books corpus (or as I have come to call it, the "re-training" corpus, so as not to confuse the "pre-training" phase that aims to integrate the KAR into BERT, and the actual pre-training of BERT itself), I had not gotten around to attempting to replicate the perplexity results for freshly (p)retrained KnowBert models that can be found in your paper.

I finally decided to attempt to replicate the results by gathering my own version of the Wikipedia + books corpus. I have subsequently trained the KARs and (p)retrained the KnowBert-Wordnet, KnowBert-Wiki, and KnowBert-W+W models using slightly modified versions of the JSONNET files in training_config/pretraining/ (the only modifications being changing URLs to local paths pointing to pre-downloaded files in order to run training offline).

I ran training with the following instruction:

allennlp train -s $OUTPUT_DIRECTORY --file-friendly-logging --include-package kb.include_all training_config/pretraining/knowbert_<variant[_linker]>_offline.jsonnet

I then evaluated perplexity of the retrained models on a heldout Wiki+Books shard using:

python bin/evaluate_perplexity.py -m  </path/to/file>/model.tar.gz -e </path/to/file>/shard_heldout.txt

I also fine-tuned the models on an NER task to see what the impact of high perplexity is on downstream tasks.

Here are the Perplexity and NER F1 results for each version of KnowBert (reproduced and reported in paper):	Model	PPL
KnowBert-Wordnet (mine)	4.8	0.84
KnowBert-Wordnet (AllenAI)	4.1	N/A
KnowBert-Wiki (mine)	27 833.6	0.00
KnowBert-Wiki (AllenAI)	4.3	N/A
KnowBert-W+W (mine)	13 760.4	0.80
KnowBert-W+W (AllenAI)	3.5	N/A

As you can see, the perplexity and performance of my reproduction of KnowBert-Wordnet is consistent with the paper, but I cannot seem to get the Wiki and W+W editions to behave as expected. The perplexity of KnowBert-Wiki and KnowBert-W+W are actually more consistent with the perplexity of my custom KnowBert models. The NER performance is also puzzling, as the fine-tuned KnowBert-Wiki has not produced a single true positive in evaluation (it did, however, produce a few in validation).

My criterion for stopping the (p)retraining is running out of allocated computation time, i.e. 7 days of training on a cluster equipped with Nvidia V100 GPUs.

Can you discern anything I might be doing incorrectly? I haven't re-run the experiments very many times given how computationally expensive they are. I would think my wikipedia KAR is not just stuck in a really bad local optimum given the end-of-training report of the KAR (see below).

I would greatly appreciate any help or pointers to figure out what the source of my inability to replicate your results might be. Thank you for your time, and I wish you continued success in your ongoing endeavours. -- Guy

Annex: Wikipedia KAR end-of-training report excerpt:


  "training_wiki_el_recall": 0.9242674921278753,
  "training_wiki_el_f1": 0.9535818512195965,
  "training_wiki_span_precision": 0.9903611032129656,
  "training_wiki_span_recall": 0.9294710999626408,
  "training_wiki_span_f1": 0.9589504983205271,
  "training_loss": 0.00021481883407398835,```

Glad to hear you have found the code useful and are building custom versions of KnowBert. It's hard to say exactly what the problem is, but here are some ideas to help isolate it. Perplexity values of 10,000+ indicate a bug somewhere, and it isn't necessary to run the model to completion to debug it. After starting training, the training loss should begin to decrease almost immediately, or at least should not increase too much. If it begins to increase rapidly then CTRL-C as it won't recover. This shouldn't take much time, and you can run it on just a single GPU, or multiple GPUs.

Here are the steps I'd take to help isolate the problem. If you provide more details on the results of these we can narrow it down further.

First reproduce the published results with your data by computing perplexity with KnowBert+Wiki with the released model. If OK --> your environment is set up properly, and your version of the data is formatted as KnowBert expects
Start training the released KnowBert W+W starting from the released checkpoint. Training should be stable without significantly increasing the loss.
Substitute your trained KAR and start fine tuning.

Thank you very much for your quick answer, and apologies for taking so much time to get back to you.

First reproduce the published results with your data by computing perplexity with KnowBert+Wiki with the released model.

Your version of KnowBert-Wiki achieves a PPL of 10.37 (lm_loss_wgt ~= 2.337) on my data. I interpret this as my data being slightly unexpected, but not fundamentally broken. Would you agree? Here's a comparison of the formatting of our respective heldout corpus shards:
Yours (39 MB):

0        Traditionally , Switzerland has avoided alliances that might entail military , political , or direct economic action . Only in recent years have the Swiss broadened the scope of activities in which they feel able to participate without compromising their neutrality .        Switzerland is not a member of the European Union and joined the United Nations very late compared to its European neighbors .

Mine (500 MB):

1        Bhagirathpur is a village and a gram panchayat in the Domkal CD block in the Domkol subdivision of Murshidabad district in the state of West Bengal, India.        He served in the New York State Legislature as an assemblyman from Dutchess County from 1785 to 1787 and 1788 to 1790. Photo, Col Jacob Griffin 1730-1800 (said to represent), Frick Art Reference Library/Frick Digital Collection. Date unknown. Artist unknown: American School.

The most prominent difference I see is how we handle punctuation, as each of your commas and periods are preceded by a space. Do you think this could be the source of the discrepancy? How do you decide where to insert spaces?

I decided to check the model's PPL on your heldout shard as provided in the README, here are the results:

duplicate_mentions_cnt:  6777
end of p_e_m reading. wall time: 1.1009117603302  minutes
p_e_m_errors:  0
incompatible_ent_ids:  0
{'total_loss_ema': 0.0, 'nsp_loss_ema': 0.0, 'lm_loss_ema': 0.0, 'total_loss': 0, 'nsp_loss': 0, 'lm_loss': 0, 'lm_loss_wgt': 0, 'mrr': 0.0, 'nsp_accuracy': 0.0, 'wiki_el_precision': 0.0, 'wiki_el_recall': 0.0, 'wiki_el_f1': 0.0, 'wiki_span_precision': 0.0, 'wiki_span_recall': 0.0, 'wiki_span_f1': 0.0}

What strikes me is the fact that all the values in the dictionary are 0. Is this normal? I believe I am running your code as-is, with my only intervention being in making the model run offline without attempting to resolve URLs.

Start training the released KnowBert W+W starting from the released checkpoint.

I assumed that by "checkpoint" you meant the model provided in the Pretrained Models section of the README.

Training should be stable without significantly increasing the loss.

The training loss is stable without significantly increasing, here are the types of values I see getting logged:

total_loss = nsp_loss ; varies in range [0.0045 ; 0.0048] total_loss_ema = nsp_loss_ema ; varies in range [0.0003 ; 0.0907] lm_loss_ema = lm_loss = lm_loss_wgt = mrr = 0.0000

Is it expected for the lm_losses to be 0?

Substitute your trained KAR and start fine tuning.

I'm not certain how to go about replacing my KAR as you suggest. I can load the two models as such:

from allennlp.models.archival import load_archive
from kb import include_all
from allennlp.common.util import prepare_environment

archive_file_wiki = "<path to model>"
archive_file_kar = "<path to model>"

cuda_device = 0
overrides = ""
weights_file = ""

wiki_archive = load_archive(archive_file_wiki, cuda_device, overrides, weights_file)
kar_archive = load_archive(archive_file_kar, cuda_device, overrides, weights_file)
prepare_environment(wiki_archive.config)
wiki_model = wiki_archive.model
kar_model = kar_archive.model

My intuition (which is probably wrong) would then be to combine the models like this:

wiki_model._modules['wiki_soldered_kg'] = kar_model._modules['wiki_soldered_kg']
wiki_model.soldered_kgs = kar_model.soldered_kgs

But then I don't understand how to save the model as a file. I've looked at the allennlp.models.archival.archive_model function, but I don't understand how I'm supposed to tell it what model to archive. Would you mind helping with that part?

Again, I'd like to thank you for your assistance, it's been invaluable.

As a follow-up to my previous post, I have re-processed my heldout shard so that it more closely matches yours. To do this, I word-tokenized each sentence using Spacy, and then put a single space between each token. Here is a sample of the resulting output:

1       Bhagirathpur is a village and a gram panchayat in the Domkal CD block in the Domkol subdivision of Murshidabad district in the state of West Bengal , India .   He served in the New York State Legislature as an assemblyman from Dutchess County from 1785 to 1787 and 1788 to 1790 . Photo , Col Jacob Griffin 1730 - 1800 ( said to represent ) , Frick Art Reference Library / Frick Digital Collection . Date unknown . Artist unknown : American School .

I then evaluated the perplexity of your trained version of KnowBert-Wiki on this shard, which resulted in a lower perplexity of 4.90, which is more in line with the results in your paper than my previous value of 10.37.

Although I doubt this will change much, I am going to re-process my entire corpus to be more in line with yours, hoping this will be close some of the gap between my replication of your work and your models. Do you know if my method with Spacy is correct? Or should I be using a different tokenizer to be closer to your corpus?

I think the fact that the LM loss terms are 0 is abnormal, and likely related to why I can't seem to replicate your results. My hypothesis is that the LM loss is somehow not being optimized at all. I am currently investigating why this occurs.

Here's an overdue update on the situation: As it turns out, the glob module wasn't parsing the path of my corpus properly in some cases. Rather than failing, from the point of view of the training script, this was indistinguishable from having finished MLM training. The training process then carried on normally without optimizing the MLM objective. The LM loss was thus unconstrained, and with this degree of freedom, the model unlearned language modeling. This behavior can also occur if there is not enough re-training data compared to the Entity Linking data. The lack of any error messages and the fact that the dataset loading is done in a different process made figuring out what was going on quite tricky. I would like to be able to report that everything works now ; unfortunately I have recently had general trouble with multiprocessing, running into issues similar to these: [1] [2]. However, in my case, none of the proposed solutions have been helpful. I am currently attempting to do away with dataset-related multiprocessing altogether. I will try to update this thread as the situation develops.

Hi again, So I managed to reproduce the results, but had to make some not insignificant changes. Here are my perplexity results:	Model	My PPL
KnowBert-Wordnet	4.8	4.1
KnowBert-Wiki	5.6	4.3
KnowBert-W+W	4.1	3.5

Overall, my reproductions have slightly worse PPL, which may be due to the different corpus or less training time. In any case, performance is close enough that I'm considering this solved.

In the end, I had two main issues to deal with: properly specifying my training data and local files instead of URLs in the JSONNET config files (due to my computing cluster being offline), and the multiprocess dataset reader which for some reason refuses to work for me (see issues #27 and allenai/allennlp#4847).

Attached is an archive with my offline singleprocess config files, a custom single-process sharded dataset reader, and an updated include_all.py, all organized in the same directory tree structure as this project. With this, anyone with the same constraints as me should be able to reproduce the results. Of course, anyone attempting this will have to manually download all the files and may decide / have to use a different local file path in the configuration files. Therefore, anyone concerned should bear in mind that this is not strictly plug-and-play, but, you should only be a few curls and seds away from getting everything working.

offline_singleprocess_config.tar.gz

Best of luck to anyone picking this up in the future, and many thanks to all involved in this project.

allenai / kb

Can't seem to replicate perplexity for KnowBert-Wiki & KnowBert-W+W #38