THUDM / OAG-AQA

Other
6 stars 5 forks source link

Issue using pretrained model #5

Closed Jonas-sc closed 5 months ago

Jonas-sc commented 5 months ago

I am having issues using the pretrained model. When running

python generate_dense_embeddings.py

I get the following output:

generate_dense_embeddings.py:86: UserWarning: 
The version_base parameter is not specified.
Please specify a compatability version level, or None.
Will assume defaults for version 1.1
  @hydra.main(config_path="conf", config_name="gen_embs")     
C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\defaults_list.py:251: UserWarning: In 'gen_embs': Defaults list is missing `_self_`. See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/default_composition_order for more information
  warnings.warn(msg, UserWarning)
C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\core\default_element.py:128: UserWarning: In 'ctx_sources/default_sources': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/changes_to_package_header for more information
  See {url} for more information"""
C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\core\default_element.py:128: UserWarning: In 'encoder/hf_bert': Usage of deprecated keyword in package header '# @package _group_'.
See https://hydra.cc/docs/1.2/upgrades/1.0_to_1.1/changes_to_package_header forCC:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\hydra.py:127: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  configure_logging=with_log_configuration,
[2024-05-28 14:29:58,496][root][INFO] - CFG's local_rank=-1
[2024-05-28 14:29:58,496][root][INFO] - Env WORLD_SIZE=None
[2024-05-28 14:29:58,497][root][INFO] - Initialized host JSDELL as d.rank -1 on device=cpu, n_gpu=0, world size=1
[2024-05-28 14:29:58,497][root][INFO] - 16-bits training: False
[2024-05-28 14:29:58,498][root][INFO] - Reading saved model from C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/outputs/2024-04-15/15-27-08/output_dpr/dpr_biencoder.29
[2024-05-28 14:29:59,496][root][INFO] - model_state_dict keys odict_keys(['model_dict', 'optimizer_dict', 'scheduler_dict', 'offset', 'epoch', 'encoder_params'])
[2024-05-28 14:29:59,497][root][INFO] - CFG:
[2024-05-28 14:29:59,500][root][INFO] - encoder:
  encoder_model_type: hf_bert
  pretrained_model_cfg: C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/data/kddcup/PTMs/bert-base-uncased
  pretrained_file: C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/data/kddcup/PTMs/bert-base-uncased
  projection_dim: 0
  sequence_length: 256
  dropout: 0.1
  fix_ctx_encoder: false
  pretrained: true
ctx_sources:
  dpr_wiki:
    _target_: dpr.data.retriever_data.CsvCtxSrc
    file: /home/shishijie-test2/workspace/project/DPR/dpr/downloads/data/wikipedia_split/psgs_w100.tsv
    id_prefix: 'wiki:'
  dpr_stackex_qa:
    _target_: dpr.data.retriever_data.CsvCtxSrc
    file: C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/data/kddcup/dpr/candidate_papers.tsv
    id_prefix: 'wiki:'
model_file: C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/outputs/2024-04-15/15-27-08/output_dpr/dpr_biencoder.29
ctx_src: dpr_stackex_qa
encoder_type: ctx
out_file: C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/outputs/2024-04-15/15-27-08/output_dpr/ctx_encoder_29.pkl_0
do_lower_case: true
shard_id: 0
num_shards: 1
batch_size: 128
tables_as_passages: false
special_tokens: null
tables_chunk_sz: 100
tables_split_type: type1
local_rank: -1
device: cpu
distributed_world_size: 1
distributed_port: null
no_cuda: false
n_gpu: 0
fp16: false
fp16_opt_level: O1

[2024-05-28 14:29:59,711][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/data/kddcup/PTMs/bert-base-uncased
[2024-05-28 14:30:00,520][dpr.models.hf_models][INFO] - Initializing HF BERT Encoder. cfg_name=C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/data/kddcup/PTMs/bert-base-uncased
[2024-05-28 14:30:01,387][root][INFO] - Loading saved model state ...
[2024-05-28 14:30:01,442][root][INFO] - reading data source: dpr_stackex_qa
[2024-05-28 14:30:01,444][dpr.data.retriever_data][INFO] - Reading file C:/Users/jonas_prg/programming/uni/DL_Project/Python3_7/OAG-AQA/data/kddcup/dpr/candidate_papers.tsv
Error executing job with overrides: []
Traceback (most recent call last):
  File "generate_dense_embeddings.py", line 159, in <module>
    main()
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\main.py", line 99, in decorated_main
    config_name=config_name,
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\utils.py", line 401, in _run_hydra
    overrides=overrides,
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\utils.py", line 458, in _run_app
    lambda: hydra.run(
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\utils.py", line 223, in run_and_report
    raise ex
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\utils.py", line 220, in run_and_report
    return func()
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\_internal\utils.py", line 461, in <lambda>
    raise self._return_value
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\.venv\lib\site-packages\hydra\core\utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "generate_dense_embeddings.py", line 132, in main
    ctx_src.load_data_to(all_passages_dict)
  File "C:\Users\jonas_prg\programming\uni\DL_Project\Python3_7\OAG-AQA\dpr\data\retriever_data.py", line 275, in load_data_to
    for row in reader:
  File "C:\Users\jonas_prg\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1700: character maps to <undefined>

I changed the config files specified in the readme and the options.py in the dpr folder, because it would overwrite the model filepaths with those stored in the model and give an error. grafik I have all requirements installed except Hydra, as it conflicts with Hydra-core

I am using Windows.

Any help is appreciated.

zfjsail commented 5 months ago

Hi, the codes are not tested on Windows. Could you run normally on Linux?

Jonas-sc commented 5 months ago

I got it working on Linux. I also had to change OAG-AQA/dpr/data/retriever_data.py with open(self.file) as ifile: to with open(self.file, encoding='utf-8') as ifile: in lines 106 and 273.

When running bash dense_retriever.sh I also had to change OAG-AQA/dense_retriever.py line378 to my file path. I think this line of code should be read from the cfg file, but is currently hard coded.

With these made changes it works on Linux for me. I also tried it on Windows, but the terminal crashes when trying to execute bash dense_retriever.sh