dave-s477 / SoMeNLP

Information Extraction for Software Mentions in Scientific articles
MIT License
4 stars 2 forks source link

Missing "encoding.json" when trying to use the bin/predict #5

Open PetrosStav opened 1 year ago

PetrosStav commented 1 year ago

Hello @dave-s477 ,

I'm trying to run the bin/predict function as suggested by you in this issue https://github.com/dave-s477/SoMeNLP/issues/4

using the pretrained checkpoint provided here: https://zenodo.org/record/7400022/files/M_SB_sw_info_opt.pth?download=1

I have edited the pred_multi_opt2_SciBERT.json to include the corrects paths to the checkpoint and the SciBert tokenizer.

However when I try to run it I get the following error message:

File "C:\Users\Petros\Desktop\testing\somenlp_test\predictions.py", line 101, in <module>
  predict(model_config, all_files, device, args.prepro, args.bio_pred, args.sum_pred)
File "C:\Users\Petros\Desktop\testing\somenlp_test\SoMeNLP\somenlp\NER\run_model.py", line 45, in predict
  data_handler.encoding()
File "C:\Users\Petros\Desktop\testing\somenlp_test\SoMeNLP\somenlp\NER\data_handler.py", line 165, in encoding
  self.encoding = self.output_handler.load_encoding()
File "C:\Users\Petros\Desktop\testing\somenlp_test\SoMeNLP\somenlp\NER\output_handler.py", line 33, in load_encoding
  with open('{}/encoding.json'.format(self.save_dir), 'r') as json_file:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Petros\\Desktop\\testing\\somenlp_test\\save_dir/encoding.json'

By taking a look at the code, when a checkpoint is entered then it searches for this encoding.json file, which as far as I have searched is not provided.

Here is the actual code snippet in NER/data_hander.py:

def encoding(self, tags_only=False):
    if self.checkpoint is not None and self.checkpoint['model']:
        print("Loading given encodings")
        self.encoding = self.output_handler.load_encoding()
        for k, v in self.encoding.items():
            if self.multi_task_mapping and k.endswith('tag2name'):
                for sk, vk in v.items():
                    vk_new = {int(key): value for key, value in vk.items()}
                    self.encoding[k][sk] = vk_new
            elif k.endswith('name'):
                v_new = {int(key): value for key, value in v.items()}
                self.encoding[k] = v_new

and in NER/output_hander.py

def load_encoding(self):
    with open('{}/encoding.json'.format(self.save_dir), 'r') as json_file:
        encoding_dict = json.load(json_file)
    return encoding_dict

The output in the terminal until that point is:

Predicting a total of 1 files
Setting up cuda
Working on GPU: 0

Setting up output handler

Setting up data handler
Loading given encodings

I there something else I'm not getting right or is it just that this encoding.json file is missing? If so, can you please provide it so that I can run the predict function with the pretrained checkpoint?

Thanks in advance for your help! :-)

dave-s477 commented 1 year ago

Hello, you are right, the file is required to run the pre-trained model. It was and oversight that is was not yet added to the repository, I am glad you caught it.

For now I added the encoding under ./data/encoding.json. (We will also add the encoding file to Zenodo as soon as possible.) The corresponding parameter in the configuration file should be general/checkpoint/save_dir. I performed a quick test and it should work to directly set "save_dir"="./data/encoding.json" in the config.

PetrosStav commented 1 year ago

Thank you very much @dave-s477 for the quick response, everything is working now!

Another quick question; in the encoding.json, as well as in the predictions from the system, I see that you have the software and soft_type predictions along with their tags.

The soft_type and mention_type seem straightforward, so my question is why is the "B-Application" and "I-Application" are both in software and in soft_type.

My goal here is to map them to the "Software Type", "Mention Type" and "Additional Information" that are outlined in the SoMeSci paper.

Thank you again for your help! :-)

  "software": {
      "O": 0,
      "B-Application": 1,
      "B-Version": 2,
      "B-Citation": 3,
      "B-Developer": 4,
      "I-Developer": 5,
      "I-Version": 6,
      "B-Release": 7,
      "I-Application": 8,
      "B-Extension": 9,
      "B-Abbreviation": 10,
      "B-URL": 11,
      "I-Release": 12,
      "I-URL": 13,
      "B-AlternativeName": 14,
      "I-AlternativeName": 15,
      "I-Extension": 16,
      "I-Citation": 17,
      "B-License": 18,
      "I-License": 19,
      "I-Abbreviation": 20
  },
  "soft_type": {
      "O": 0,
      "B-Application": 1,
      "B-PlugIn": 2,
      "I-Application": 3,
      "B-ProgrammingEnvironment": 4,
      "I-PlugIn": 5,
      "B-OperatingSystem": 6,
      "I-OperatingSystem": 7,
      "I-ProgrammingEnvironment": 8,
      "B-SoftwareCoreference": 9,
      "I-SoftwareCoreference": 10
  },
  "mention_type": {
      "O": 0,
      "B-Usage": 1,
      "I-Usage": 2,
      "B-Mention": 3,
      "B-Creation": 4,
      "I-Creation": 5,
      "I-Mention": 6,
      "B-Deposition": 7,
      "I-Deposition": 8
  }