HazyResearch / bootleg

Self-Supervision for Named Entity Disambiguation at the Tail
http://hazyresearch.stanford.edu/bootleg
Apache License 2.0
214 stars 27 forks source link

Entity embedding training is not using GPU on Google Colab Pro+ #112

Closed coolcoder001 closed 2 years ago

coolcoder001 commented 2 years ago

Hi @lorr1 ,

The entity_embedding_tutorial file not using GPU on Google Colab , even though GPU is available.

I am using Google Colab Pro+. Colab Pro+ has 51 GB RAM.

Building entity data from scratch. -- this steps fails everytime due to out of memory (RAM) error. It is supposed to use GPU , right ?

Here are the config parameters:

data_config:
  context_mask_perc: 0.0
  data_dir: /content/bootleg/tutorials/data
  data_prep_dir: prep
  dev_dataset:
    file: merged_sample.jsonl
    use_weak_label: true
  entity_dir: /content/bootleg/tutorials/data/entity_db
  entity_kg_data:
    kg_symbols_dir: kg_mappings
    use_entity_kg: true
  entity_type_data:
    type_symbols_dir: type_mappings/wiki
    use_entity_types: true
  eval_slices:
  - unif_all
  - unif_NS_all
  - unif_HD
  - unif_TO
  - unif_TL
  - unif_TS
  max_ent_len: 128
  max_seq_len: 128
  max_seq_window_len: 64
  overwrite_preprocessed_data: false
  test_dataset:
    file: merged_sample.jsonl
    use_weak_label: true
  train_dataset:
    file: train.jsonl
    use_weak_label: true
  train_in_candidates: true
  use_entity_desc: true
  word_embedding:
    bert_model: bert-base-uncased
    cache_dir: /content/bootleg/tutorials/data/pretrained_bert_models
    context_layers: 6
    entity_layers: 6
emmental:
  checkpoint_all: true
  checkpoint_freq: 1
  checkpoint_metric: NED/Bootleg/dev/final_loss/acc_boot:max
  checkpointing: true
  clear_intermediate_checkpoints: false
  counter_unit: batch
  evaluation_freq: 21432
  fp16: true
  grad_clip: 1.0
  gradient_accumulation_steps: 1
  l2: 0.01
  log_path: /content/bootleg/tutorials/data/bootleg_wiki
  lr: 2e-5
  lr_scheduler: linear
  n_steps: 428648
  online_eval: false
  dataparallel: false
  use_exact_log_path: true
  warmup_percentage: 0.1
  write_loss_per_step: true
  writer: json
model_config:
  hidden_size: 200
  normalize: true
  temperature: 0.10
run_config:
  #dataloader_threads: 2
  dataset_threads: 20
  eval_batch_size: 32
  log_level: DEBUG
  spawn_method: forkserver
train_config:
  batch_size: 32
lorr1 commented 2 years ago

So building the entity data (before any model is loaded on the GPU) is a CPU bound process. It is a heavily parallel process that builds a large matrix of entity token ids etc to be access during training. I usually build this on a machine with 100-150GB of memory. If you use fewer entities, it'll take up less memory. You can also try to set dataset_threads to be 1. That will also reduce the memory pressure on the CPU.