dalab / deep-ed

Source code for the EMNLP'17 paper "Deep Joint Entity Disambiguation with Local Neural Attention", https://arxiv.org/abs/1704.04920
Apache License 2.0
224 stars 50 forks source link

gen_test_train_data/gen_all.lua (Step. 10) doesn't work #21

Closed titsuki closed 5 years ago

titsuki commented 5 years ago

Hi, Step 10 doesn't work on my environment. I think there are incorrect settings on the environment. Could you tell me the Lua or Linux versions on which the paper experiment was conducted?

Reproducible procedure:

# Status: Step1. ~ Step.9 are done
root@96a4f2e0d6cd:~/deep-ed# th data_gen/gen_test_train_data/gen_all.lua -root_data_dir /root/
==> Loading redirects index 
    Done loading redirects index    
==> Loading entity wikiid - name map    
  ---> from t7 file: /root/generated/ent_name_id_map.t7 
    Done loading entity name - wikiid. Size thid index = 4306070    
==> Loading crosswikis_wikipedia from file /root/generated/crosswikis_wikipedia_p_e_m.txt   
Processed 2000000 lines.    
Processed 4000000 lines.    
Processed 6000000 lines.    
Processed 8000000 lines.    
Processed 10000000 lines.   
Processed 12000000 lines.   
==> Loading yago index from file /root/generated/yago_p_e_m.txt 
Processed 2000000 lines.    
Processed 4000000 lines.    
Processed 6000000 lines.    
Processed 8000000 lines.    
Processed 10000000 lines.   
Processed 12000000 lines.   
    Done loading index  

Generating test data from AIDA set  
Entity Derek Ryan not found. Redirects file needs to be loaded for better performance.  
Entity Iván García not found. Redirects file needs to be loaded for better performance. 
Entity Akhbar not found. Redirects file needs to be loaded for better performance.  
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Oksana Grishina not found. Redirects file needs to be loaded for better performance. 
Entity Craig Brown not found. Redirects file needs to be loaded for better performance. 
Entity John Collins not found. Redirects file needs to be loaded for better performance.    
Entity International Boxing Association not found. Redirects file needs to be loaded for better performance.    
Entity Ramón Ramírez not found. Redirects file needs to be loaded for better performance.   
Done validation testA :     
num_nme = 1126; num_nonexistent_ent_title = 3189    
num_nonexistent_ent_id = 0; num_nonexistent_both = 35   
num_correct_ents = 1567; num_total_ents = 4791  
Entity World Open not found. Redirects file needs to be loaded for better performance.  
Entity Douglas Young not found. Redirects file needs to be loaded for better performance.   
Entity Douglas Young not found. Redirects file needs to be loaded for better performance.   
Entity James Love not found. Redirects file needs to be loaded for better performance.  
Entity Noel Whelan not found. Redirects file needs to be loaded for better performance. 
    Done AIDA.  
num_nme = 2257; num_nonexistent_ent_title = 6255    
num_nonexistent_ent_id = 0; num_nonexistent_both = 72   
num_correct_ents = 2949; num_total_ents = 9276  

Generating train data from AIDA set     
Entity Craig Brown not found. Redirects file needs to be loaded for better performance. 
Entity International cricketers of South African origin not found. Redirects file needs to be loaded for better performance.    
Entity Jonathan Stark not found. Redirects file needs to be loaded for better performance.  
Entity Carlos Costa not found. Redirects file needs to be loaded for better performance.    
Entity Antonio Esposito not found. Redirects file needs to be loaded for better performance.    
Entity Independence Day (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Independence Day (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Erik Hanson not found. Redirects file needs to be loaded for better performance. 
Entity Erik Hanson not found. Redirects file needs to be loaded for better performance. 
Entity Iván García not found. Redirects file needs to be loaded for better performance. 
Entity Camelot, Chesapeake, Virginia not found. Redirects file needs to be loaded for better performance.   
Entity Jonathan Stark not found. Redirects file needs to be loaded for better performance.  
Entity Gordon Parsons not found. Redirects file needs to be loaded for better performance.  
Entity Xhosa not found. Redirects file needs to be loaded for better performance.   
Entity Xhosa not found. Redirects file needs to be loaded for better performance.   
Entity Jamaat-e-Islami not found. Redirects file needs to be loaded for better performance. 
Entity Ford Escort not found. Redirects file needs to be loaded for better performance. 
Entity Ford Escort not found. Redirects file needs to be loaded for better performance. 
Entity Franz Konrad not found. Redirects file needs to be loaded for better performance.    
Entity Ford Escort not found. Redirects file needs to be loaded for better performance. 
Entity Carlos Costa not found. Redirects file needs to be loaded for better performance.    
Entity Craig Evans not found. Redirects file needs to be loaded for better performance. 
Entity Preston not found. Redirects file needs to be loaded for better performance. 
Entity Superman (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Superman (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Jonathan Stark not found. Redirects file needs to be loaded for better performance.  
Entity Ashta not found. Redirects file needs to be loaded for better performance.   
Entity John Smiley not found. Redirects file needs to be loaded for better performance. 
Entity Derek Ryan not found. Redirects file needs to be loaded for better performance.  
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Oksana Grishina not found. Redirects file needs to be loaded for better performance. 
Entity Derek Ryan not found. Redirects file needs to be loaded for better performance.  
Entity Bandundu not found. Redirects file needs to be loaded for better performance.    
Entity Čelopek not found. Redirects file needs to be loaded for better performance. 
    Done AIDA.  
num_nme = 4855; num_nonexistent_ent_title = 12103   
num_nonexistent_ent_id = 0; num_nonexistent_both = 236  
num_correct_ents = 6202; num_total_ents = 18541 
==> Loading redirects index 
    Done loading redirects index    
==> Loading entity wikiid - name map    
  ---> from t7 file: /root/generated/ent_name_id_map.t7 
    Done loading entity name - wikiid. Size thid index = 4306070    
==> Loading crosswikis_wikipedia from file /root/generated/crosswikis_wikipedia_p_e_m.txt   
Processed 2000000 lines.    
Processed 4000000 lines.    
Processed 6000000 lines.    
Processed 8000000 lines.    
Processed 10000000 lines.   
Processed 12000000 lines.   
==> Loading yago index from file /root/generated/yago_p_e_m.txt 
Processed 2000000 lines.    
Processed 4000000 lines.    
Processed 6000000 lines.    
Processed 8000000 lines.    
Processed 10000000 lines.   
Processed 12000000 lines.   
    Done loading index  

Generating test data from wikipedia set     
Entity Christina (given name) not found. Redirects file needs to be loaded for better performance.  
Christina (given name)  
Entity Christina (given name) not found. Redirects file needs to be loaded for better performance.  
Christina (given name)  
Entity Kirsten not found. Redirects file needs to be loaded for better performance. 
Kirsten 
/root/torch/install/bin/luajit: data_gen/gen_test_train_data/gen_ace_msnbc_aquaint_csv.lua:184: attempt to index local 'it' (a nil value)
stack traceback:
    data_gen/gen_test_train_data/gen_ace_msnbc_aquaint_csv.lua:184: in function 'gen_test_ace'
    data_gen/gen_test_train_data/gen_ace_msnbc_aquaint_csv.lua:202: in main chunk
    [C]: in function 'dofile'
    data_gen/gen_test_train_data/gen_all.lua:13: in main chunk
    [C]: in function 'dofile'
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x00405d50

Environment:

Dockerfile:

FROM ubuntu:16.04

RUN apt-get update
RUN apt-get install -y git
RUN git clone https://github.com/torch/distro.git /root/torch --recursive
RUN apt-get install -y sudo python-software-properties unzip wget
RUN apt-get install -y lua5.1:amd64 lua5.1-dev:amd64
RUN wget https://luarocks.org/releases/luarocks-3.0.4.tar.gz \
        && tar zxpf luarocks-3.0.4.tar.gz \
        && cd luarocks-3.0.4 \
        && ./configure \
        && sudo bash -c "make bootstrap" \
        && sudo bash -c "luarocks install luasocket"
RUN cd /root/torch && sudo bash -c "./install-deps" && ./install.sh

RUN mkdir /root/generated
RUN git clone https://github.com/dalab/deep-ed /root/deep-ed

setup.sh (A setup script. I ran it it after building the docker container.)

curl -c /tmp/cookies "https://drive.google.com/uc?export=download&id=0Bx8d3azIm_ZcbHMtVmRVc1o5TWM" > /tmp/intermed-basic-data.html
curl -L -b /tmp/cookies "https://drive.google.com$(cat /tmp/intermed-basic-data.html | grep -Po 'uc-download-link" [^>]* href="\K[^"]*' | sed 's/\&/\&/g')" > /root/basic_data.zip
cd /root && unzip basic_data.zip

curl -c /tmp/cookies "https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM" > /tmp/intermed-w2v.html
curl -L -b /tmp/cookies "https://drive.google.com$(cat /tmp/intermed-w2v.html | grep -Po 'uc-download-link" [^>]* href="\K[^"]*' | sed 's/\&/\&/g')" > /root/GoogleNews-vectors-negative300.bin.gz
cd /root && gunzip GoogleNews-vectors-negative300.bin.gz
mv /root/GoogleNews-vectors-negative300.bin /root/basic_data/wordEmbeddings/Word2Vec

luarocks install tds

docker version (host computer's one):

$ docker --version
Docker version 18.06.1-ce, build e68fc7a

Cheers,

octavian-ganea commented 5 years ago

It seems to me that the file opened at gen_ace_msnbc_aquaint_csv.lua:182 does not exist. Can you check that the path defined in line gen_ace_msnbc_aquaint_csv.lua:42 contains the test datasets ? This should be true if step 4 was done properly.

titsuki commented 5 years ago

@octavian-ganea Thanks for your response!

Can you check that the path defined in line gen_ace_msnbc_aquaint_csv.lua:42 contains the test datasets ?

Here is the result tree returns:

# tree /root/basic_data/test_datasets/wned-datasets/
/root/basic_data/test_datasets/wned-datasets/
|-- README
|-- WHERE_TO_GET_THIS_DATA~
|-- ace2004
|   |-- RawText
|   |   |-- 20000715_AFP_ARB.0072.eng
|   |   |-- 20000815_AFP_ARB.0071.eng
|   |   |-- 20001015_AFP_ARB.0053.eng
|   |   |-- 20001015_AFP_ARB.0229.eng
|   |   |-- 20001115_AFP_ARB.0013.eng
|   |   |-- 20001115_AFP_ARB.0030.eng
|   |   |-- 20001115_AFP_ARB.0060.eng
|   |   |-- 20001115_AFP_ARB.0061.eng
|   |   |-- 20001115_AFP_ARB.0065.eng
|   |   |-- 20001115_AFP_ARB.0072.eng
|   |   |-- 20001115_AFP_ARB.0089.eng
|   |   |-- 20001115_AFP_ARB.0093.eng
|   |   |-- 20001115_AFP_ARB.0184.eng
|   |   |-- 20001115_AFP_ARB.0210.eng
|   |   |-- 20001115_AFP_ARB.0212.eng
|   |   |-- 20001115_AFP_ARB.0217.eng
|   |   |-- APW20001001.2021.0521
|   |   |-- APW20001002.0615.0146
|   |   |-- APW20001016.1325.0321
|   |   |-- APW20001017.1313.0396
|   |   |-- APW20001022.1735.0376
|   |   |-- APW20001023.2100.0686
|   |   |-- APW20001102.1223.0376
|   |   |-- APW20001120.1450.0376
|   |   |-- APW20001127.1346.0419
|   |   |-- APW20001130.2108.0849
|   |   |-- APW20001202.0257.0120
|   |   |-- APW20001203.1456.0329
|   |   |-- APW20001207.2118.0838
|   |   |-- APW20001208.1126.0362
|   |   |-- APW20001211.0507.0196
|   |   |-- APW20001216.2012.0590
|   |   |-- APW20001218.2221.0727
|   |   |-- APW20001219.1316.0416
|   |   |-- APW20001225.2035.0477
|   |   |-- NYT20001002.1754.0290
|   |   |-- NYT20001101.2212.0429
|   |   |-- NYT20001106.1705.0187
|   |   |-- NYT20001109.1946.0315
|   |   |-- NYT20001123.1511.0062
|   |   |-- NYT20001124.2050.0257
|   |   |-- NYT20001125.1558.0117
|   |   |-- NYT20001129.2040.0383
|   |   |-- NYT20001217.2241.0165
|   |   |-- PRI20001031.2000.1824
|   |   |-- PRI20001122.2000.0320
|   |   |-- PRI20001128.2000.0055
|   |   |-- PRI20001201.2000.1828
|   |   |-- VOA20001020.2100.1853
|   |   |-- VOA20001129.2000.0364
|   |   |-- VOA20001208.2000.1275
|   |   |-- VOA20001220.2000.0060
|   |   |-- VOA20001223.2000.0139
|   |   |-- chtb_165.eng
|   |   |-- chtb_171.eng
|   |   |-- chtb_227.eng
|   |   `-- chtb_267.eng
|   `-- ace2004.xml
|-- aquaint
|   |-- RawText
|   |   |-- APW19980603_0791.htm
|   |   |-- APW19980603_1617.htm
|   |   |-- APW19980604_0787.htm
|   |   |-- APW19980610_0111.htm
|   |   |-- APW19980611_0774.htm
|   |   |-- APW19980614_0031.htm
|   |   |-- APW19980615_0417.htm
|   |   |-- APW19980620_0458.htm
|   |   |-- APW19980624_0436.htm
|   |   |-- APW19980624_0607.htm
|   |   |-- APW19980625_1136.htm
|   |   |-- APW19980627_0596.htm
|   |   |-- APW19980709_0263.htm
|   |   |-- APW19980713_0449.htm
|   |   |-- APW19980808_0196.htm
|   |   |-- APW19980811_0512.htm
|   |   |-- APW19980816_0994.htm
|   |   |-- APW19980824_0827.htm
|   |   |-- APW19980903_1073.htm
|   |   |-- APW19980917_0818.htm
|   |   |-- APW19980930_0284.htm
|   |   |-- APW19980930_0522.htm
|   |   |-- APW19981001_0866.htm
|   |   |-- APW19981010_0354.htm
|   |   |-- APW19981020_1367.htm
|   |   |-- APW19981022_0630.htm
|   |   |-- APW19981022_0710.htm
|   |   |-- APW19981026_0096.htm
|   |   |-- APW19981106_0920.htm
|   |   |-- APW19981109_0140.htm
|   |   |-- APW19981109_0152.htm
|   |   |-- APW19981109_0440.htm
|   |   |-- APW19981109_0464.htm
|   |   |-- APW19981109_1089.htm
|   |   |-- APW19981109_1172.htm
|   |   |-- APW19981113_0500.htm
|   |   |-- APW19981113_0729.htm
|   |   |-- APW19981119_0585.htm
|   |   |-- APW19981120_1056.htm
|   |   |-- APW19981130_0743.htm
|   |   |-- APW19981210_0433.htm
|   |   |-- APW19981215_1083.htm
|   |   |-- APW19990120_0179.htm
|   |   |-- APW19990203_0315.htm
|   |   |-- APW19990519_0141.htm
|   |   |-- APW19990526_0131.htm
|   |   |-- APW19990827_0137.htm
|   |   |-- APW19990827_0184.htm
|   |   |-- APW20000303_0067.htm
|   |   `-- APW20000312_0050.htm
|   `-- aquaint.xml
|-- clueweb
|   |-- RawText
|   |   |-- clueweb12-0500wb-04-24340
|   |   |-- clueweb12-0500wb-07-07346
|   |   |-- clueweb12-0500wb-07-08673
|   |   |-- clueweb12-0500wb-07-08677
|   |   |-- clueweb12-0500wb-07-11872
|   |   |-- clueweb12-0500wb-07-28725
|   |   |-- clueweb12-0500wb-10-30439
|   |   |-- clueweb12-0500wb-16-19356
|   |   |-- clueweb12-0500wb-16-24407
|   |   |-- clueweb12-0500wb-17-08784
|   |   |-- clueweb12-0500wb-17-27238
|   |   |-- clueweb12-0500wb-19-13842
|   |   |-- clueweb12-0500wb-30-10322
|   |   |-- clueweb12-0500wb-30-19142
|   |   |-- clueweb12-0500wb-31-00905
|   |   |-- clueweb12-0500wb-31-03359
|   |   |-- clueweb12-0500wb-31-14123
|   |   |-- clueweb12-0500wb-33-03178
|   |   |-- clueweb12-0500wb-35-02519
|   |   |-- clueweb12-0500wb-35-21293
|   |   |-- clueweb12-0500wb-35-24654
|   |   |-- clueweb12-0500wb-35-33000
|   |   |-- clueweb12-0500wb-35-33005
|   |   |-- clueweb12-0500wb-37-00364
|   |   |-- clueweb12-0500wb-37-04409
|   |   |-- clueweb12-0500wb-38-02702
|   |   |-- clueweb12-0500wb-39-21685
|   |   |-- clueweb12-0500wb-54-21485
|   |   |-- clueweb12-0500wb-61-15900
|   |   |-- clueweb12-0500wb-62-09545
|   |   |-- clueweb12-0500wb-62-19286
|   |   |-- clueweb12-0500wb-68-02401
|   |   |-- clueweb12-0500wb-68-22591
|   |   |-- clueweb12-0500wb-68-30102
|   |   |-- clueweb12-0500wb-70-20598
|   |   |-- clueweb12-0500wb-72-07757
|   |   |-- clueweb12-0500wb-74-13854
|   |   |-- clueweb12-0500wb-74-14380
|   |   |-- clueweb12-0500wb-74-15300
|   |   |-- clueweb12-0500wb-77-31039
|   |   |-- clueweb12-0500wb-84-00046
|   |   |-- clueweb12-0500wb-96-03580
|   |   |-- clueweb12-0500wb-96-11959
|   |   |-- clueweb12-0500wb-96-14121
|   |   |-- clueweb12-0501wb-00-09041
|   |   |-- clueweb12-0501wb-00-36510
|   |   |-- clueweb12-0501wb-01-15462
|   |   |-- clueweb12-0501wb-03-25958
|   |   |-- clueweb12-0501wb-04-01041
|   |   |-- clueweb12-0501wb-04-06229
|   |   |-- clueweb12-0501wb-04-09196
|   |   |-- clueweb12-0501wb-04-30288
|   |   |-- clueweb12-0501wb-05-05698
|   |   |-- clueweb12-0501wb-05-16863
|   |   |-- clueweb12-0501wb-05-17155
|   |   |-- clueweb12-0501wb-05-31052
|   |   |-- clueweb12-0501wb-06-28661
|   |   |-- clueweb12-0501wb-09-02087
|   |   |-- clueweb12-0501wb-09-25340
|   |   |-- clueweb12-0501wb-12-23529
|   |   |-- clueweb12-0501wb-16-00982
|   |   |-- clueweb12-0501wb-16-20050
|   |   |-- clueweb12-0501wb-17-15597
|   |   |-- clueweb12-0501wb-18-10891
|   |   |-- clueweb12-0501wb-18-11681
|   |   |-- clueweb12-0501wb-21-13304
|   |   |-- clueweb12-0501wb-21-25436
|   |   |-- clueweb12-0501wb-23-07710
|   |   |-- clueweb12-0501wb-25-01939
|   |   |-- clueweb12-0501wb-25-03448
|   |   |-- clueweb12-0501wb-25-11404
|   |   |-- clueweb12-0501wb-27-26993
|   |   |-- clueweb12-0501wb-27-33204
|   |   |-- clueweb12-0501wb-29-03710
|   |   |-- clueweb12-0501wb-29-11725
|   |   |-- clueweb12-0501wb-29-32464
|   |   |-- clueweb12-0501wb-30-29261
|   |   |-- clueweb12-0501wb-31-16772
|   |   |-- clueweb12-0501wb-31-17194
|   |   |-- clueweb12-0501wb-31-22539
|   |   |-- clueweb12-0501wb-31-27451
|   |   |-- clueweb12-0501wb-33-16523
|   |   |-- clueweb12-0501wb-34-00945
|   |   |-- clueweb12-0501wb-34-34376
|   |   |-- clueweb12-0501wb-35-23286
|   |   |-- clueweb12-0501wb-37-01789
|   |   |-- clueweb12-0501wb-38-01436
|   |   |-- clueweb12-0501wb-39-18485
|   |   |-- clueweb12-0501wb-39-26619
|   |   |-- clueweb12-0501wb-40-14762
|   |   |-- clueweb12-0501wb-40-22301
|   |   |-- clueweb12-0501wb-41-31320
|   |   |-- clueweb12-0501wb-42-15997
|   |   |-- clueweb12-0501wb-42-25891
|   |   |-- clueweb12-0501wb-43-13087
|   |   |-- clueweb12-0501wb-44-00170
|   |   |-- clueweb12-0501wb-45-19647
|   |   |-- clueweb12-0501wb-46-21729
|   |   |-- clueweb12-0501wb-48-00449
|   |   |-- clueweb12-0501wb-48-19861
|   |   |-- clueweb12-0501wb-49-25960
|   |   |-- clueweb12-0501wb-52-13555
|   |   |-- clueweb12-0501wb-52-32516
|   |   |-- clueweb12-0501wb-54-22871
|   |   |-- clueweb12-0501wb-55-00003
|   |   |-- clueweb12-0501wb-55-06011
|   |   |-- clueweb12-0501wb-56-10514
|   |   |-- clueweb12-0501wb-56-16789
|   |   |-- clueweb12-0501wb-58-21067
|   |   |-- clueweb12-0501wb-59-11649
|   |   |-- clueweb12-0501wb-61-27562
|   |   |-- clueweb12-0501wb-62-15643
|   |   |-- clueweb12-0501wb-64-14081
|   |   |-- clueweb12-0501wb-65-12193
|   |   |-- clueweb12-0501wb-66-24342
|   |   |-- clueweb12-0501wb-67-13782
|   |   |-- clueweb12-0501wb-68-12348
|   |   |-- clueweb12-0501wb-69-31274
|   |   |-- clueweb12-0501wb-70-05124
|   |   |-- clueweb12-0501wb-70-14463
|   |   |-- clueweb12-0501wb-71-01322
|   |   |-- clueweb12-0501wb-71-06735
|   |   |-- clueweb12-0501wb-72-08395
|   |   |-- clueweb12-0501wb-73-35834
|   |   |-- clueweb12-0501wb-75-15053
|   |   |-- clueweb12-0501wb-76-00560
|   |   |-- clueweb12-0501wb-76-11292
|   |   |-- clueweb12-0501wb-76-21404
|   |   |-- clueweb12-0501wb-76-28210
|   |   |-- clueweb12-0501wb-77-15901
|   |   |-- clueweb12-0501wb-78-14351
|   |   |-- clueweb12-0501wb-78-19265
|   |   |-- clueweb12-0501wb-79-00401
|   |   |-- clueweb12-0501wb-79-05134
|   |   |-- clueweb12-0501wb-80-04105
|   |   |-- clueweb12-0501wb-80-17824
|   |   |-- clueweb12-0501wb-80-27834
|   |   |-- clueweb12-0501wb-81-15537
|   |   |-- clueweb12-0501wb-83-07261
|   |   |-- clueweb12-0501wb-83-18207
|   |   |-- clueweb12-0501wb-85-11282
|   |   |-- clueweb12-0501wb-86-16603
|   |   |-- clueweb12-0501wb-86-19365
|   |   |-- clueweb12-0501wb-86-24847
|   |   |-- clueweb12-0501wb-86-25362
|   |   |-- clueweb12-0501wb-87-14877
|   |   |-- clueweb12-0501wb-88-01729
|   |   |-- clueweb12-0501wb-88-22871
|   |   |-- clueweb12-0501wb-91-13151
|   |   |-- clueweb12-0501wb-91-19043
|   |   |-- clueweb12-0501wb-91-19798
|   |   |-- clueweb12-0501wb-92-16378
|   |   |-- clueweb12-0501wb-92-23889
|   |   |-- clueweb12-0501wb-93-01863
|   |   |-- clueweb12-0501wb-93-23297
|   |   |-- clueweb12-0501wb-97-28866
|   |   |-- clueweb12-0501wb-98-00398
|   |   |-- clueweb12-0501wb-98-01695
|   |   |-- clueweb12-0501wb-98-27096
|   |   |-- clueweb12-0502wb-01-07830
|   |   |-- clueweb12-0502wb-01-07839
|   |   |-- clueweb12-0502wb-01-13335
|   |   |-- clueweb12-0502wb-05-04897
|   |   |-- clueweb12-0502wb-06-07392
|   |   |-- clueweb12-0502wb-07-00626
|   |   |-- clueweb12-0502wb-07-25981
|   |   |-- clueweb12-0502wb-08-17623
|   |   |-- clueweb12-0502wb-09-02801
|   |   |-- clueweb12-0502wb-10-02722
|   |   |-- clueweb12-0502wb-11-01409
|   |   |-- clueweb12-0502wb-11-09817
|   |   |-- clueweb12-0502wb-11-10748
|   |   |-- clueweb12-0502wb-12-25515
|   |   |-- clueweb12-0502wb-13-32426
|   |   |-- clueweb12-0502wb-19-13507
|   |   |-- clueweb12-0502wb-22-10515
|   |   |-- clueweb12-0502wb-23-20364
|   |   |-- clueweb12-0502wb-23-20374
|   |   |-- clueweb12-0502wb-24-09989
|   |   |-- clueweb12-0502wb-24-10471
|   |   |-- clueweb12-0502wb-26-30335
|   |   |-- clueweb12-0502wb-27-22535
|   |   |-- clueweb12-0502wb-28-11340
|   |   |-- clueweb12-0502wb-28-36092
|   |   |-- clueweb12-0502wb-30-08900
|   |   |-- clueweb12-0502wb-32-04277
|   |   |-- clueweb12-0502wb-32-21133
|   |   |-- clueweb12-0502wb-32-30117
|   |   |-- clueweb12-0502wb-33-10905
|   |   |-- clueweb12-0502wb-34-07818
|   |   |-- clueweb12-0502wb-35-07067
|   |   |-- clueweb12-0502wb-35-35928
|   |   |-- clueweb12-0502wb-36-04706
|   |   |-- clueweb12-0502wb-36-20589
|   |   |-- clueweb12-0502wb-38-30139
|   |   |-- clueweb12-0502wb-39-16287
|   |   |-- clueweb12-0502wb-39-22053
|   |   |-- clueweb12-0502wb-39-24640
|   |   |-- clueweb12-0502wb-39-31620
|   |   |-- clueweb12-0502wb-42-10179
|   |   |-- clueweb12-0502wb-48-27650
|   |   |-- clueweb12-0502wb-52-07233
|   |   |-- clueweb12-0502wb-53-22806
|   |   |-- clueweb12-0502wb-53-22811
|   |   |-- clueweb12-0502wb-54-30197
|   |   |-- clueweb12-0502wb-56-04891
|   |   |-- clueweb12-0502wb-60-21143
|   |   |-- clueweb12-0502wb-60-32167
|   |   |-- clueweb12-0502wb-61-32124
|   |   |-- clueweb12-0502wb-62-18382
|   |   |-- clueweb12-0502wb-65-07212
|   |   |-- clueweb12-0502wb-65-07334
|   |   |-- clueweb12-0502wb-68-34428
|   |   |-- clueweb12-0502wb-72-29686
|   |   |-- clueweb12-0502wb-74-05238
|   |   |-- clueweb12-0502wb-78-19678
|   |   |-- clueweb12-0502wb-78-19697
|   |   |-- clueweb12-0502wb-83-02837
|   |   |-- clueweb12-0502wb-83-28640
|   |   |-- clueweb12-0502wb-85-15102
|   |   |-- clueweb12-0502wb-87-26082
|   |   |-- clueweb12-0502wb-87-29367
|   |   |-- clueweb12-0502wb-89-17873
|   |   |-- clueweb12-0502wb-90-19859
|   |   |-- clueweb12-0502wb-94-00368
|   |   |-- clueweb12-0502wb-94-30665
|   |   |-- clueweb12-0503wb-00-01032
|   |   |-- clueweb12-0503wb-00-04603
|   |   |-- clueweb12-0503wb-00-04929
|   |   |-- clueweb12-0503wb-00-06989
|   |   |-- clueweb12-0503wb-00-07839
|   |   |-- clueweb12-0503wb-00-07940
|   |   |-- clueweb12-0503wb-00-10109
|   |   |-- clueweb12-0503wb-00-10360
|   |   |-- clueweb12-0503wb-00-11874
|   |   |-- clueweb12-0503wb-00-27980
|   |   |-- clueweb12-0503wb-01-11548
|   |   |-- clueweb12-0503wb-01-22470
|   |   |-- clueweb12-0503wb-01-33875
|   |   |-- clueweb12-0503wb-03-11022
|   |   |-- clueweb12-0503wb-05-04508
|   |   |-- clueweb12-0503wb-05-29884
|   |   |-- clueweb12-0503wb-06-03758
|   |   |-- clueweb12-0503wb-06-09868
|   |   |-- clueweb12-0503wb-07-25927
|   |   |-- clueweb12-0503wb-08-14873
|   |   |-- clueweb12-0503wb-11-26383
|   |   |-- clueweb12-0503wb-11-26407
|   |   |-- clueweb12-0503wb-11-26409
|   |   |-- clueweb12-0503wb-11-32519
|   |   |-- clueweb12-0503wb-12-00731
|   |   |-- clueweb12-0503wb-12-15032
|   |   |-- clueweb12-0503wb-13-06212
|   |   |-- clueweb12-0503wb-13-18380
|   |   |-- clueweb12-0503wb-16-14284
|   |   |-- clueweb12-0503wb-16-22821
|   |   |-- clueweb12-0503wb-16-28606
|   |   |-- clueweb12-0503wb-17-12889
|   |   |-- clueweb12-0503wb-17-19581
|   |   |-- clueweb12-0503wb-18-01494
|   |   |-- clueweb12-0503wb-19-19449
|   |   |-- clueweb12-0503wb-21-23741
|   |   |-- clueweb12-0503wb-23-30843
|   |   |-- clueweb12-0503wb-23-34273
|   |   |-- clueweb12-0503wb-24-23914
|   |   |-- clueweb12-0503wb-27-18578
|   |   |-- clueweb12-0503wb-28-03992
|   |   |-- clueweb12-0503wb-28-04880
|   |   |-- clueweb12-0503wb-29-03440
|   |   |-- clueweb12-0503wb-32-03304
|   |   |-- clueweb12-0503wb-32-07315
|   |   |-- clueweb12-0503wb-34-01393
|   |   |-- clueweb12-0503wb-34-13315
|   |   |-- clueweb12-0503wb-39-03743
|   |   |-- clueweb12-0503wb-39-13394
|   |   |-- clueweb12-0503wb-39-17456
|   |   |-- clueweb12-0503wb-40-01750
|   |   |-- clueweb12-0503wb-44-06215
|   |   |-- clueweb12-0503wb-45-08115
|   |   |-- clueweb12-0503wb-45-16307
|   |   |-- clueweb12-0503wb-46-16624
|   |   |-- clueweb12-0503wb-46-17619
|   |   |-- clueweb12-0503wb-46-17631
|   |   |-- clueweb12-0503wb-46-17632
|   |   |-- clueweb12-0503wb-49-29942
|   |   |-- clueweb12-0503wb-54-06811
|   |   |-- clueweb12-0503wb-57-06075
|   |   |-- clueweb12-0503wb-57-17777
|   |   |-- clueweb12-0503wb-58-02370
|   |   |-- clueweb12-0503wb-58-12655
|   |   |-- clueweb12-0503wb-59-02552
|   |   |-- clueweb12-0503wb-61-23674
|   |   |-- clueweb12-0503wb-62-12469
|   |   |-- clueweb12-0503wb-63-23259
|   |   |-- clueweb12-0503wb-63-29106
|   |   |-- clueweb12-0503wb-64-08266
|   |   |-- clueweb12-0503wb-65-15894
|   |   |-- clueweb12-0503wb-69-01292
|   |   |-- clueweb12-0503wb-69-12537
|   |   |-- clueweb12-0503wb-69-12742
|   |   |-- clueweb12-0503wb-70-06937
|   |   |-- clueweb12-0503wb-70-07956
|   |   |-- clueweb12-0503wb-70-18257
|   |   |-- clueweb12-0503wb-71-15780
|   |   |-- clueweb12-0503wb-75-19080
|   |   |-- clueweb12-0503wb-76-28205
|   |   |-- clueweb12-0503wb-77-04955
|   |   |-- clueweb12-0503wb-77-13425
|   |   |-- clueweb12-0503wb-77-27166
|   |   |-- clueweb12-0503wb-78-21632
|   |   |-- clueweb12-0503wb-79-09330
|   |   |-- clueweb12-0503wb-80-31901
|   |   |-- clueweb12-0503wb-84-16931
|   |   |-- clueweb12-0503wb-84-16932
|   |   |-- clueweb12-0503wb-89-28933
|   |   |-- clueweb12-0503wb-91-03655
|   |   |-- clueweb12-0503wb-96-05351
|   |   |-- clueweb12-0503wb-96-11521
|   |   |-- clueweb12-0503wb-97-01049
|   |   `-- clueweb12-0503wb-97-14877
|   |-- clueweb-name2bracket.tsv
|   |-- clueweb-result-summary.tsv.csv
|   `-- clueweb.xml
|-- msnbc
|   |-- RawText
|   |   |-- 13259309
|   |   |-- 16384904
|   |   |-- 16417540
|   |   |-- 16442287
|   |   |-- 16442342
|   |   |-- 16443053
|   |   |-- 16444023
|   |   |-- 16444229
|   |   |-- 16444287
|   |   |-- 16447201
|   |   |-- 16447720
|   |   |-- 16451112
|   |   |-- 16451212
|   |   |-- 16451635
|   |   |-- 16452612
|   |   |-- 16453733
|   |   |-- 16454203
|   |   |-- 16454435
|   |   |-- 16455207
|   |   `-- 3683270
|   `-- msnbc.xml
`-- wikipedia
    |-- RawText
    |   |-- 1966#U201368_Liga_Leumit
    |   |-- 1994_Winter_Olympics_opening_ceremony
    |   |-- 1996_Big_12_Championship_Game
    |   |-- 2009_European_Pairs_Speedway_Championship
    |   |-- 2009_Superfinalen
    |   |-- 2009_Team_Speedway_Junior_European_Championship
    |   |-- 2010_Marshall_Thundering_Herd_football_team
    |   |-- 2010_NASCAR_Canadian_Tire_Series_season
    |   |-- 2011_Valencian_Community_motorcycle_Grand_Prix
    |   |-- 4769_Castalia
    |   |-- A_Trip_Down_Memory_Lane
    |   |-- Aaron_Thomas_(cricketer)
    |   |-- Abbey_Park,_Nottinghamshire
    |   |-- Alabama_State_Route_13
    |   |-- Alessandro_Gramigni
    |   |-- Alexander_MacDonnell,_3rd_Earl_of_Antrim
    |   |-- Alfred_Conkling_Coxe,_Sr.
    |   |-- Alfred_Schickel
    |   |-- Andrea_Giganti
    |   |-- Andrew_Carter_(cricketer)
    |   |-- Andrew_Hele
    |   |-- Andrew_Procter_(cricketer)
    |   |-- Andy_Flynn_(footballer)
    |   |-- Ante-chapel
    |   |-- Antonio_Rossi
    |   |-- Appollo_(dog)
    |   |-- Arnold_Townsend
    |   |-- Arthur_Keegan
    |   |-- Assembly_of_European_Wine-producing_Regions
    |   |-- Atiq-ul-Rehman
    |   |-- Augustus_Simon_Frazer
    |   |-- AutoTrack
    |   |-- BLU-109_bomb
    |   |-- Barrett_Green
    |   |-- Bastille_discography
    |   |-- Battle_of_Vila_Velha
    |   |-- Beeren_Island
    |   |-- Bering_Sea_Squadron
    |   |-- Big_Blue_River_(Indiana)
    |   |-- Bill_Schulz
    |   |-- Bioneers
    |   |-- Black_Lake_Bayou
    |   |-- Bob_Coverdale
    |   |-- Bradley_Dale_Peveto
    |   |-- Brian_Tamberlin
    |   |-- Bulgarian_Black_Sea_Coast
    |   |-- CA_Saint-#U00c9tienne_Loire_Sud_Rugby
    |   |-- Calumet_Region
    |   |-- Cave_Rock_Tunnel
    |   |-- Cecelia_Joyce
    |   |-- Central_Appalachian_pine-oak_rocky_woodland
    |   |-- Central_Lakes_State_Trail
    |   |-- Ch#U00e2teau_d'Oiron
    |   |-- Charles_Fitzgerald_(rugby)
    |   |-- Chetco_people
    |   |-- Children_in_Need_Rocks_Manchester
    |   |-- Chippenham_United_F.C.
    |   |-- Chris_Rushworth
    |   |-- Christine_(name)
    |   |-- Christopher_Andrus
    |   |-- Clara_Nordstr#U00f6m
    |   |-- Claudiopolis_(Cilicia)
    |   |-- Cody_monoplane
    |   |-- Colin_Evans_(rugby)
    |   |-- Colombo_West_Electoral_District
    |   |-- Colombophis
    |   |-- Colorado_State_Highway_94
    |   |-- Commonwealth_men
    |   |-- Confessin'
    |   |-- Country_blues
    |   |-- Cyclone_Taylor_Trophy
    |   |-- Czech_Republic_men's_national_ice_hockey_team
    |   |-- D-block_contraction
    |   |-- Daniel_Bovet
    |   |-- Daniel_Levy_(politician)
    |   |-- Darren_Shadford
    |   |-- David_West_(basketball)
    |   |-- Davit_Kubriashvili
    |   |-- Dedi_I,_Margrave_of_the_Saxon_Ostmark
    |   |-- Dennis_Pilgrim
    |   |-- Derek_Morgan_(cricketer)
    |   |-- Derrick_Schofield
    |   |-- Diabolique_(band)
    |   |-- Division_of_Port_Adelaide
    |   |-- Donald_Hogarth
    |   |-- Doug_Insole
    |   |-- Doug_Melvin_(rower)
    |   |-- Douglas_Dickinson
    |   |-- EMD_E8
    |   |-- East_Mississippi_State_Hospital
    |   |-- El_Madrid_de_los_Austrias
    |   |-- Electoral_district_of_Colton
    |   |-- Electoral_district_of_Mount_Hawthorn
    |   |-- Electoral_district_of_Murray-Darling
    |   |-- Electoral_district_of_Wembley_Beaches
    |   |-- Electoral_division_of_Apsley
    |   |-- Empower_MediaMarketing
    |   |-- Energy_in_Sudan
    |   |-- Enticho_(woreda)
    |   |-- Evelyn_Fanshawe
    |   |-- Exchange_Quay_Metrolink_station
    |   |-- Ficoll
    |   |-- Flavius_Justus
    |   |-- Florida_Gulf_Coast_Eagles_men's_basketball
    |   |-- Frances_Carpenter
    |   |-- Frank_A._Moore
    |   |-- Frank_Coombs
    |   |-- Frank_Mortimer
    |   |-- Frank_S._Pepper
    |   |-- Fred_J._Hume_Award
    |   |-- Fresia,_Chile
    |   |-- Furanocoumarin
    |   |-- G-sharp_major
    |   |-- Gabriel_Bouvery
    |   |-- Gao_Wei
    |   |-- Gemaal_Hussain
    |   |-- Gender_binary
    |   |-- Genesis_Group
    |   |-- George_Clifford_Wilson
    |   |-- George_Waddell
    |   |-- Gerwyn_Edwards
    |   |-- Giovanni_Battista_Landolina
    |   |-- Gmina_Jaraczewo
    |   |-- Gmina_Przedecz
    |   |-- Gmina_Tucz#U0119py
    |   |-- Goh_Seng_Choo_Gallery
    |   |-- Goldie_Hexagon_Racing
    |   |-- Gran_Omar
    |   |-- Greater_London_Council_election,_1970
    |   |-- Green_Lane_railway_station
    |   |-- Gregg_Brandon
    |   |-- Hagop_Sandaldjian
    |   |-- Halsey_Beshears
    |   |-- Hama_Yumi
    |   |-- Harry_Hooper_(footballer_born_1910)
    |   |-- Harry_Taylor_(rugby_league)
    |   |-- Harvard_Crimson_men's_lacrosse
    |   |-- Hassanine_Sebei
    |   |-- Heikki_Kovalainen
    |   |-- Hittin'_the_Trail_for_Hallelujah_Land
    |   |-- Hockley_Valley_Provincial_Nature_Reserve
    |   |-- Hong_Kong_Family_Welfare_Society
    |   |-- House_of_Palatinate-Birkenfeld
    |   |-- Houston_College_Classic
    |   |-- Hugh_Waddell_(rugby_union)
    |   |-- Hughie_Wilson
    |   |-- Human_image_synthesis
    |   |-- Hunterdon_Plateau
    |   |-- Iemasa_Tokugawa
    |   |-- Inland_Waterways_Authority_of_India
    |   |-- Inspectorates-General_(Turkey)
    |   |-- Interstate_691
    |   |-- Iowa's_10th_congressional_district
    |   |-- Iowa's_11th_congressional_district
    |   |-- Iowa_Highway_7
    |   |-- Jablanica_(river)
    |   |-- Jack_Kennedy_(hurler)
    |   |-- Jackie_Tyrrell
    |   |-- Jacob_de_Jager
    |   |-- Jacques_Thibaud
    |   |-- James_Barrow
    |   |-- James_Motluk
    |   |-- Jan-Erasmus_Quellinus
    |   |-- Janene_Higgins
    |   |-- Jeanne_d'#U00c9vreux
    |   |-- Jeffris_Hopkins
    |   |-- Jeremy_Davis
    |   |-- Jessica_Mauboy_discography
    |   |-- Ji#U0159#U00ed_T#U0159anovsk#U00fd
    |   |-- Jimmy_Rooney
    |   |-- John_Burton_(political_agent)
    |   |-- John_Moore_(cricketer,_born_1943)
    |   |-- John_Wertheim
    |   |-- Johnny_Moss
    |   |-- Jos#U00e9_Evangelista
    |   |-- Joseph_J._Cannon
    |   |-- Joseph_Smith_(cricketer)
    |   |-- Juan_Cruz_Ochoa
    |   |-- Juan_Cuevas_Perales
    |   |-- Judy_Roderick
    |   |-- Julian_Knowles
    |   |-- Julius_Scriver
    |   |-- June_Preisser
    |   |-- Jutta_Nardenbach
    |   |-- Katrine_Lunde_Haraldsen
    |   |-- Kazuo_Aoki
    |   |-- Kenneth_Willis_Clark_Collection
    |   |-- Kilometre_Zero_(Bucharest)
    |   |-- King_Diamond_discography
    |   |-- Krasi,_Thalassa_Kai_T'_Agori_Mou
    |   |-- Larry_Worrell
    |   |-- Laurie_Johnson_(cricketer)
    |   |-- Law_&_Order_(season_16)
    |   |-- Leading_Creek_(Ohio)
    |   |-- Leighton_Hodges
    |   |-- Live_Nation_UK
    |   |-- Love's_Welcome_at_Bolsover
    |   |-- Love_&_Life_(Mary_J._Blige_album)
    |   |-- Lubov_Egorova
    |   |-- Luc_Alphand
    |   |-- M-79_(Michigan_highway)
    |   |-- MV_Tustumena
    |   |-- Maclay_Murray_&_Spens
    |   |-- Madarihat
    |   |-- Major_League_Baseball_on_TSN
    |   |-- Malcolm_Azania
    |   |-- Malone_Area_Heritage_Museum
    |   |-- Maneer_Mirza
    |   |-- Manfred_Seissler
    |   |-- Manti_National_Forest
    |   |-- Marcel_Hirscher
    |   |-- Marcus_Marvell
    |   |-- Marcus_Thomas_Pius_Gilbert
    |   |-- Margit_Schumann
    |   |-- Margot_Leverett
    |   |-- Marillier_shot
    |   |-- Marksville_culture
    |   |-- Markus_Prock
    |   |-- Mary_O'Connor_(sportsperson)
    |   |-- Matt_Higgins_(ice_hockey)
    |   |-- Matt_Kohler
    |   |-- Maxwell_Hunter
    |   |-- May_Peterson_Thompson
    |   |-- Melville-Saltcoats
    |   |-- Men_at_Work_(season_1)
    |   |-- Messier_49
    |   |-- Michael_Youll
    |   |-- Mike_Smith_(jazz_saxophonist)
    |   |-- Mississippi_Delta_National_Heritage_Area
    |   |-- Mississippi_Hills_National_Heritage_Area
    |   |-- Mondo_2000
    |   |-- Moses_Hamon
    |   |-- Mountadam_Vineyards
    |   |-- NTFS-3G
    |   |-- Neal_Porter
    |   |-- Nebraska_Highway_11
    |   |-- Nelsonic_Industries
    |   |-- Nembrionic
    |   |-- Nether_Poppleton_Tithebarn
    |   |-- New_Manchester
    |   |-- New_York_Yankees_(1936_AFL)
    |   |-- Nidderdale_Way
    |   |-- Noel_Purcell_(water_polo)
    |   |-- Nucleoid
    |   |-- OK_Hotel
    |   |-- Omicron2_Canis_Majoris
    |   |-- Oregon_Route_10
    |   |-- Oriol_Lozano
    |   |-- Paddy_Tuimavave
    |   |-- Panhandle
    |   |-- Parkway_Center_Mall
    |   |-- Party_of_New_Forces
    |   |-- Paul_New
    |   |-- Paul_Roshier
    |   |-- Pawnee_Rangers
    |   |-- Peire_Pelet
    |   |-- Penn_State_Lady_Lions_basketball
    |   |-- Penske_PC-22
    |   |-- Peter_Rochford
    |   |-- Peter_Scott_(cricketer)
    |   |-- Petorca_Province
    |   |-- Philip_Threlfall
    |   |-- Progressive_Democratic_Party_(Tunisia)
    |   |-- Province_of_Calatayud
    |   |-- Putin's_rynda
    |   |-- R#U00edo_Verde,_Chile
    |   |-- R._F._Bayford
    |   |-- Rabbit_River
    |   |-- Rainer_Polak
    |   |-- Rally_Ireland
    |   |-- Randy_Turner
    |   |-- Rapp_Road_Community_Historic_District
    |   |-- Richard_of_Salerno
    |   |-- Rio_Grande_Association
    |   |-- River_Bride
    |   |-- Robert_Alexander_(rugby_union_and_cricket)
    |   |-- Roger_Clitheroe
    |   |-- Rogier_Koordes
    |   |-- Roland_Hyatt
    |   |-- Roman_Catholic_Diocese_of_Superior
    |   |-- Ron_Ryder
    |   |-- Roy_Vincent
    |   |-- Ruby_B._DeMesme
    |   |-- Rugby_union_in_Asia
    |   |-- Satavahana_Express
    |   |-- Scotch_and_Soda
    |   |-- Shaka_Smart
    |   |-- Sherwin_Campbell
    |   |-- Shorkot
    |   |-- Simon_Hugh_Holmes
    |   |-- Simon_L._Adler
    |   |-- Solemn_League_and_Covenant
    |   |-- Sopwith_1919_Schneider_Cup_Seaplane
    |   |-- Source_of_the_Nile_(board_game)
    |   |-- South_Carolina_Highway_200
    |   |-- South_East_Lancashire_(UK_Parliament_constituency)
    |   |-- South_Gippsland_Highway
    |   |-- Southwest_Tennessee_Development_District
    |   |-- Spectacled_Tern
    |   |-- Spondylosoma
    |   |-- St._Michael_the_Archangel_Church_(Cleveland,_Ohio)
    |   |-- Statue_of_Europe
    |   |-- Steadfastness_and_Confrontation_Front
    |   |-- Steep_Falls,_Maine
    |   |-- Stephen_Martin_(field_hockey)
    |   |-- Steve_Durbano
    |   |-- Stewart_Hutton
    |   |-- Sturgeon_House
    |   |-- Taifa_of_Badajoz
    |   |-- Taylor_Pond_Wild_Forest
    |   |-- Teddy_Holland
    |   |-- Texas_State_Highway_110
    |   |-- The_Great_White_Hope_(film)
    |   |-- Theramine
    |   |-- Thomas_Land_(Drayton_Manor)
    |   |-- Thomas_Pearsall_(cricketer)
    |   |-- Tim_Hemp
    |   |-- Todd_Wider
    |   |-- Tom_Baxter_(footballer_born_1903)
    |   |-- Tommy_Cairns
    |   |-- Tony_Blanco
    |   |-- Tony_Drake
    |   |-- Trade_Lines_(newspaper)
    |   |-- Turkey_River_(Iowa)
    |   |-- Ubaoner
    |   |-- Ulrike_Maier
    |   |-- Urla_Clashes
    |   |-- Vi_vil_oss_et_land
    |   |-- Victor_Croome
    |   |-- Vijaya_Dasa
    |   |-- W#U00fcrttemberger
    |   |-- Wesley_Brown_Field_House
    |   |-- Wheeling_Creek_(Ohio)
    |   |-- Whitesands_Bay_(Pembrokeshire)
    |   |-- Wijnand_van_der_Sanden
    |   |-- William_Carr_Lane
    |   |-- William_Wood,_1st_Baron_Hatherley
    |   |-- Wolf_Prize
    |   |-- World_Without_Superman
    |   |-- X_Corps_(Union_Army)
    |   |-- Ya'akov_Riftin
    |   |-- Yakov_Malkiel
    |   |-- Yellowback_stingaree
    |   |-- Yves_Fortier_(lawyer)
    |   `-- Zielona_G#U00f3ra_(parliamentary_constituency)
    |-- wikipedia-name2bracket.tsv
    `-- wikipedia.xml

10 directories, 802 files
octavian-ganea commented 5 years ago

Yes, but you should debug yourself why that line 182 is not opening a valid file.

titsuki commented 5 years ago

@octavian-ganea I found that some filenames in the basic_data.zip are different from the original WNED filenames( https://www.dropbox.com/s/987hmjdoq0cql9z/WNED.tar.gz )

For example,

WNED > wned-datasets > wikipedia:

RawText/Zielona_Góra_(parliamentary_constituency)

basic_data.zip:

RawText/Zielona_G#U00f3ra_(parliamentary_constituency)

So, I replaced the basic_data.zip's ones with the original WNED ones.

After that, it passes the step. 10:

# th data_gen/gen_test_train_data/gen_all.lua -root_data_dir /root/
==> Loading redirects index 
    Done loading redirects index    
==> Loading entity wikiid - name map    
  ---> from t7 file: /root/generated/ent_name_id_map.t7 
    Done loading entity name - wikiid. Size thid index = 4306070    
==> Loading crosswikis_wikipedia from file /root/generated/crosswikis_wikipedia_p_e_m.txt   
Processed 2000000 lines.    
Processed 4000000 lines.    
Processed 6000000 lines.    
Processed 8000000 lines.    
Processed 10000000 lines.   
Processed 12000000 lines.   
==> Loading yago index from file /root/generated/yago_p_e_m.txt 
Processed 2000000 lines.    
Processed 4000000 lines.    
Processed 6000000 lines.    
Processed 8000000 lines.    
Processed 10000000 lines.   
Processed 12000000 lines.   
    Done loading index  

Generating test data from AIDA set  
Entity Derek Ryan not found. Redirects file needs to be loaded for better performance.  
Entity Iván García not found. Redirects file needs to be loaded for better performance. 
Entity Akhbar not found. Redirects file needs to be loaded for better performance.  
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Oksana Grishina not found. Redirects file needs to be loaded for better performance. 
Entity Craig Brown not found. Redirects file needs to be loaded for better performance. 
Entity John Collins not found. Redirects file needs to be loaded for better performance.    
Entity International Boxing Association not found. Redirects file needs to be loaded for better performance.    
Entity Ramón Ramírez not found. Redirects file needs to be loaded for better performance.   
Done validation testA :     
num_nme = 1126; num_nonexistent_ent_title = 3189    
num_nonexistent_ent_id = 0; num_nonexistent_both = 35   
num_correct_ents = 1567; num_total_ents = 4791  
Entity World Open not found. Redirects file needs to be loaded for better performance.  
Entity Douglas Young not found. Redirects file needs to be loaded for better performance.   
Entity Douglas Young not found. Redirects file needs to be loaded for better performance.   
Entity James Love not found. Redirects file needs to be loaded for better performance.  
Entity Noel Whelan not found. Redirects file needs to be loaded for better performance. 
    Done AIDA.  
num_nme = 2257; num_nonexistent_ent_title = 6255    
num_nonexistent_ent_id = 0; num_nonexistent_both = 72   
num_correct_ents = 2949; num_total_ents = 9276  

Generating train data from AIDA set     
Entity Craig Brown not found. Redirects file needs to be loaded for better performance. 
Entity International cricketers of South African origin not found. Redirects file needs to be loaded for better performance.    
Entity Jonathan Stark not found. Redirects file needs to be loaded for better performance.  
Entity Carlos Costa not found. Redirects file needs to be loaded for better performance.    
Entity Antonio Esposito not found. Redirects file needs to be loaded for better performance.    
Entity Independence Day (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Independence Day (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Erik Hanson not found. Redirects file needs to be loaded for better performance. 
Entity Erik Hanson not found. Redirects file needs to be loaded for better performance. 
Entity Iván García not found. Redirects file needs to be loaded for better performance. 
Entity Camelot, Chesapeake, Virginia not found. Redirects file needs to be loaded for better performance.   
Entity Jonathan Stark not found. Redirects file needs to be loaded for better performance.  
Entity Gordon Parsons not found. Redirects file needs to be loaded for better performance.  
Entity Xhosa not found. Redirects file needs to be loaded for better performance.   
Entity Xhosa not found. Redirects file needs to be loaded for better performance.   
Entity Jamaat-e-Islami not found. Redirects file needs to be loaded for better performance. 
Entity Ford Escort not found. Redirects file needs to be loaded for better performance. 
Entity Ford Escort not found. Redirects file needs to be loaded for better performance. 
Entity Franz Konrad not found. Redirects file needs to be loaded for better performance.    
Entity Ford Escort not found. Redirects file needs to be loaded for better performance. 
Entity Carlos Costa not found. Redirects file needs to be loaded for better performance.    
Entity Craig Evans not found. Redirects file needs to be loaded for better performance. 
Entity Preston not found. Redirects file needs to be loaded for better performance. 
Entity Superman (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Superman (disambiguation) not found. Redirects file needs to be loaded for better performance.   
Entity Jonathan Stark not found. Redirects file needs to be loaded for better performance.  
Entity Ashta not found. Redirects file needs to be loaded for better performance.   
Entity John Smiley not found. Redirects file needs to be loaded for better performance. 
Entity Derek Ryan not found. Redirects file needs to be loaded for better performance.  
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Michael Andersson not found. Redirects file needs to be loaded for better performance.   
Entity Oksana Grishina not found. Redirects file needs to be loaded for better performance. 
Entity Derek Ryan not found. Redirects file needs to be loaded for better performance.  
Entity Bandundu not found. Redirects file needs to be loaded for better performance.    
Entity Čelopek not found. Redirects file needs to be loaded for better performance. 
    Done AIDA.  
num_nme = 4855; num_nonexistent_ent_title = 12103   
num_nonexistent_ent_id = 0; num_nonexistent_both = 236  
num_correct_ents = 6202; num_total_ents = 18541 

Generating test data from wikipedia set     
Entity Christina (given name) not found. Redirects file needs to be loaded for better performance.  
Christina (given name)  
Entity Christina (given name) not found. Redirects file needs to be loaded for better performance.  
Christina (given name)  
Entity Kirsten not found. Redirects file needs to be loaded for better performance. 
Kirsten 
Entity Leslie Townsend not found. Redirects file needs to be loaded for better performance. 
Leslie Townsend 
Entity Leslie Townsend not found. Redirects file needs to be loaded for better performance. 
Leslie Townsend 
Entity U.S. Route 40 not found. Redirects file needs to be loaded for better performance.   
U.S. Route 40   
Entity Ashland, Louisiana not found. Redirects file needs to be loaded for better performance.  
Ashland, Louisiana  
Entity List of Farm to Market Roads in Texas (1–99) not found. Redirects file needs to be loaded for better performance.    
Farm to Market Road 16  
Entity List of Farm to Market Roads in Texas (1–99) not found. Redirects file needs to be loaded for better performance.    
Farm to Market Road 17  
Entity List of Farm to Market Roads in Texas (1–99) not found. Redirects file needs to be loaded for better performance.    
Farm to Market Road 17  
Entity List of state highways in Colorado not found. Redirects file needs to be loaded for better performance.  
State highways in Colorado  
Entity U.S. Route 40 not found. Redirects file needs to be loaded for better performance.   
U.S. Route 40   
Entity U.S. Route 40 not found. Redirects file needs to be loaded for better performance.   
U.S. Route 40   
Entity Thomas Gordon not found. Redirects file needs to be loaded for better performance.   
Thomas Gordon   
Entity Robert Crowley not found. Redirects file needs to be loaded for better performance.  
Robert Crowley  
Entity Tiger (comics) not found. Redirects file needs to be loaded for better performance.  
Tiger (comics)  
Entity Tiger (comics) not found. Redirects file needs to be loaded for better performance.  
Tiger (comics)  
Entity Underdog not found. Redirects file needs to be loaded for better performance.    
Underdogs   
Entity John Tyrrell not found. Redirects file needs to be loaded for better performance.    
John Tyrrell    
Entity List of minor DC Comics characters not found. Redirects file needs to be loaded for better performance.  
Sam Lane (comics)   
Entity Ryan Campbell not found. Redirects file needs to be loaded for better performance.   
Ryan Campbell   
Done wikipedia. 
num_nonexistent_ent_id = 21; num_correct_ents = 6800    

Generating test data from clueweb set   
Entity Mia Jones (Degrassi: The Next Generation) not found. Redirects file needs to be loaded for better performance.   
Mia Jones (Degrassi: The Next Generation)   
Entity World not found. Redirects file needs to be loaded for better performance.   
World   
Entity The Lord of the Rings: The Fellowship of the Ring not found. Redirects file needs to be loaded for better performance.   
The Lord of the Rings: The Fellowship of the Ring   
Entity Anthrax (disambiguation) not found. Redirects file needs to be loaded for better performance.    
Anthrax (band)  
Entity Anthrax (disambiguation) not found. Redirects file needs to be loaded for better performance.    
Anthrax (band)  
Entity World not found. Redirects file needs to be loaded for better performance.   
World   
Entity Cosmos: A Personal Voyage not found. Redirects file needs to be loaded for better performance.   
Cosmos: A Personal Voyage   
Entity Jumeirah Village not found. Redirects file needs to be loaded for better performance.    
Jumeirah Village    
Done clueweb.   
num_nonexistent_ent_id = 8; num_correct_ents = 11146    

Generating test data from ace2004 set   
Entity Lujaizui not found. Redirects file needs to be loaded for better performance.    
Lujaizui    
Done ace2004.   
num_nonexistent_ent_id = 1; num_correct_ents = 256  

Generating test data from msnbc set     
Done msnbc. 
num_nonexistent_ent_id = 0; num_correct_ents = 656  

Generating test data from aquaint set   
Entity List of radio stations in Nicaragua not found. Redirects file needs to be loaded for better performance. 
List of radio stations in Nicaragua 
Entity List of newspapers in India not found. Redirects file needs to be loaded for better performance. 
List of newspapers in India 
Entity List of fatal bear attacks in North America by decade not found. Redirects file needs to be loaded for better performance.   
List of fatal bear attacks in North America by decade   
Entity Federated State not found. Redirects file needs to be loaded for better performance. 
Federated State 
Entity List of national legal systems not found. Redirects file needs to be loaded for better performance.  
List of national legal systems  
Entity Tender not found. Redirects file needs to be loaded for better performance.  
Tender  
Entity David Richardson not found. Redirects file needs to be loaded for better performance.    
David Richardson    
Done aquaint.   
num_nonexistent_ent_id = 7; num_correct_ents = 720

Is this expected workaround? If basic_data.zip is corrupted then why the other people could pass the step 10?

titsuki commented 5 years ago

I also confirmed the stats are correct.

stats.sh:

cat wned-ace2004.csv |  wc -l
#257
cat wned-ace2004.csv |  grep -P 'GT:\t-1' | wc -l
#20
cat wned-ace2004.csv | grep -P 'GT:\t1,' | wc -l
#217

cat wned-aquaint.csv |  wc -l
#727
cat wned-aquaint.csv |  grep -P 'GT:\t-1' | wc -l
#33
cat wned-aquaint.csv | grep -P 'GT:\t1,' | wc -l
#604

cat wned-msnbc.csv  | wc -l
#656
cat wned-msnbc.csv |  grep -P 'GT:\t-1' | wc -l
#22
cat wned-msnbc.csv | grep -P 'GT:\t1,' | wc -l
#496
# bash stats.sh 
257
20
217
727
33
604
656
22
496
titsuki commented 5 years ago

@octavian-ganea Thanks for your response. I could finish through step.17 by this workaround. So I'll close this issue.

Note

My environment was as follows:

Moreover, I also encountered the issue 17 ( https://github.com/dalab/deep-ed/issues/17 ) and deleted the assertions.