dalab / deep-ed

Source code for the EMNLP'17 paper "Deep Joint Entity Disambiguation with Local Neural Attention", https://arxiv.org/abs/1704.04920
Apache License 2.0
224 stars 50 forks source link

How can i get entity vecs given specified entity set #23

Open Cantoria opened 5 years ago

Cantoria commented 5 years ago

Hi, I read your code and i know i can get all entity vecs by changing learn_a.lua -entities flag. I don't need such big vec set. How can i train entity vecs given specified entity set? Thanks.

Cantoria commented 5 years ago

By the way, when i run step 9,(I didn't run steps before, but i've downloaded all files in polybox) it appears an error

`==> Loading entity wikiid - name map
---> t7 file NOT found. Loading from disk (slower). Out f = /home/xuhongbo/syh/syh/deep-ed/data/generated/ent_name_id_map.t7
==> Loading disambiguation index
Done loading disambiguation index
Still loading entity wikiid - name map ...
/home/xuhongbo/torch/install/bin/lua: ...me/xuhongbo/torch/install/share/lua/5.1/tds/hash.lua:108: bad argument #1 to 'pairs' (table expected, got userdata) stack traceback: C: in function 'pairs' ...me/xuhongbo/torch/install/share/lua/5.1/tds/hash.lua:108: in function 'write' .../xuhongbo/torch/install/share/lua/5.1/torch/File.lua:210: in function 'writeObject' .../xuhongbo/torch/install/share/lua/5.1/torch/File.lua:388: in function 'save' entities/ent_name2id_freq/ent_name_id.lua:76: in main chunk C: in function 'dofile' entities/ent_name2id_freq/e_freq_gen.lua:16: in main chunk C: in function 'dofile' .../torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk

`

It seems that Some errors happened in generating file ent_name_id_map.t7, and i got a file ent_name_id_map.t7 in generated file(only 35B). I really don't know lua language, Please tell me what's wrong, thanks!

octavian-ganea commented 5 years ago

Hi. The set of entities for which the current code trains entity embeddings is defined here: https://github.com/dalab/deep-ed/blob/master/entities/relatedness/relatedness.lua#L253-L328

You would have to modify this code to train with a different set of entities.

As per your error, I am not sure. Try to delete your ent_name_id_map.t7 and redo that step. These t7 files are not rewritten when you change code or data and thus, have to be deleted manually and then regenerated.

Cantoria commented 5 years ago

https://github.com/dalab/deep-ed/blob/master/entities/relatedness/relatedness.lua#L253-L328

Hi, I've got the reason why the error happens. I used lua 5.1, and it doesn't support torch. So i installed lua 5.3. It works. Besides, for the first question, I've modify the codes, so i can train the ent vec via specific entity set. But i got some t7 files in ./data/generated/ent_vecs path. Here is moditied code (in main code, former line 253-328):

if not paths.filep(rewtr_t7filename) then
  print('  ---> t7 file NOT found. Loading reltd_ents_wikiid_to_rltdid from txt file instead (slower).')

  -- Gather the restricted set of entities for which we train entity embeddings:
  local rltd_all_ent_wikiids = tds.Hash()

  -- 1) From the relatedness dataset
  for ent_wikiid,_ in pairs(reltd_ents_direct_validate) do
    rltd_all_ent_wikiids[ent_wikiid] = 1
  end
  for ent_wikiid,_ in pairs(reltd_ents_direct_test) do
    rltd_all_ent_wikiids[ent_wikiid] = 1
  end

  -- 1.1) From a small dataset (used for debugging / unit testing).
  for _,line in pairs(ent_lines_4EX) do
    local parts = split(line, '\t')
    assert(table_len(parts) == 3)
    ent_wikiid = tonumber(parts[1])
    assert(ent_wikiid)
    rltd_all_ent_wikiids[ent_wikiid] = 1
  end

  -- 2) From all ED datasets: (I 've deleted)
  --3) From specific entity set (Here i add some code)
  local specific_entity_files = 'specific_entity_file'
  if not paths.filep(opt.root_data_dir .. 'basic_data/' .. specific_entity_files) then
    print("No specific entity file!")
  else
    dofile 'entities/ent_name2id_freq/ent_name_id.lua'
    it, _ = io.open(opt.root_data_dir .. 'basic_data/' .. specific_entity_files)
    local line = it:read()
    while(line) do
      ent_wikiid = e_id_name.ent_name2wikiid[line]
      rltd_all_ent_wikiids[ent_wikiid] = 1
    end
  end
  --codes below aren't changed
  -- Insert unk_ent_wikiid
  local unk_ent_wikiid = 1
  rltd_all_ent_wikiids[unk_ent_wikiid] = 1

  -- Sort all wikiids
  local sorted_rltd_all_ent_wikiids = tds.Vec()
  for ent_wikiid,_ in pairs(rltd_all_ent_wikiids) do
    sorted_rltd_all_ent_wikiids:insert(ent_wikiid)
  end
  sorted_rltd_all_ent_wikiids:sort(function(a,b) return a < b end)

  local reltd_ents_wikiid_to_rltdid = tds.Hash()
  for rltd_id,wikiid in pairs(sorted_rltd_all_ent_wikiids) do
    reltd_ents_wikiid_to_rltdid[wikiid] = rltd_id
  end

  rewtr = tds.Hash()
  rewtr.reltd_ents_wikiid_to_rltdid = reltd_ents_wikiid_to_rltdid
  rewtr.reltd_ents_rltdid_to_wikiid = sorted_rltd_all_ent_wikiids
  rewtr.num_rltd_ents = #sorted_rltd_all_ent_wikiids

  print('Writing reltd_ents_wikiid_to_rltdid to t7 File for future usage.')
  torch.save(rewtr_t7filename, rewtr)
  print('    Done saving.')

Is that correct?(specific entity files record entity name per line) And i noticed you added a small dataset in step 1 and step 1.1. Can i remove this step? If i can't, does the small dataset influence the final entity vec?

octavian-ganea commented 5 years ago

Thanks for your input.

Yes, the small dataset in 1.1 can be removed, it was just for debugging (containing < 10 entities if i recall well).

To access the specific entity vectors, you have first to load the t7 file via https://github.com/dalab/deep-ed/blob/master/entities/relatedness/relatedness.lua#L331 and then access the specific entity vectors using the dictionaries in the rewtr hashtable object. https://github.com/dalab/deep-ed/blob/master/entities/relatedness/relatedness.lua#L321-L324 Given a wiki ID of an entity, you first find its rltdid using rewtr.reltd_ents_wikiid_to_rltdid[your_wiki_id], and then you access its embedding using the rltdid row of the entity embedding tensor (from the t7 file). See an example here: https://github.com/dalab/deep-ed/blob/master/entities/pretrained_e2v/e2v.lua#L3-L28 . Sorry, this code could have been made easier ...