facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

Issue Downloading the Full Wikipedia Model #3329

Closed PaddyE9797 closed 3 years ago

PaddyE9797 commented 3 years ago

I've been attempting to run the tfidf_retriever model over the full_wikipedia set in interactive mode with this command shown in the docs, parlai interactive --model tfidf_retriever -mf zoo:wikipedia_full/tfidf_retriever/model but when I do I get this issue:

/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/env/lib/python3.8/site-packages/torch-1.7.1-py3.8-linux-x86_64.egg/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 11:03:45 | building data: /home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/data/models/wikipedia_full/tfidf_retriever/model.tgz 11:03:45 | Downloading http://parl.ai/downloads/_models/wikipedia_full/tfidf_retriever/model.tgz to /home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/data/models/wikipedia_full/tfidf_retriever/model.tgz 11:23:24 | Retried too many times, stopped retrying. Traceback (most recent call last): File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/env/bin/parlai", line 11, in <module> load_entry_point('parlai', 'console_scripts', 'parlai')() File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/__main__.py", line 14, in main superscript_main() File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/script.py", line 298, in superscript_main opt = parser.parse_args(args) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/params.py", line 1087, in parse_args self._process_args_to_opts(args) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/params.py", line 1047, in _process_args_to_opts self.opt[each_key] = modelzoo_path( File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/build_data.py", line 479, in modelzoo_path my_module.download(datapath) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/zoo/wikipedia_full/tfidf_retriever.py", line 17, in download download_models(opt, fnames, 'wikipedia_full', use_model_type=True) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/build_data.py", line 442, in download_models download(url, dpath, fname) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/build_data.py", line 210, in download raise RuntimeError('Connection broken too many times. Stopped retrying.') RuntimeError: Connection broken too many times. Stopped retrying. Downloading model.tgz: 49%|█████████████████████████████████████▌ | 3.97G/8.14G [19:39<20:36, 3.37MB/s]

I've got this error a few time, I tried to download the model.tgz manually using the URL http://parl.ai/downloads/_models/wikipedia_full/tfidf_retriever/model.tgz. But when I executed the command again I got this error:

/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/env/lib/python3.8/site-packages/torch-1.7.1-py3.8-linux-x86_64.egg/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 10:19:28 | Overriding opt["model_file"] to /home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/data/models/wikipedia_full/tfidf_retriever/model (previously: wiki_full_notitle) 10:19:28 | WARNING: Neither the specified dict file (test_ret.dict) nor themodel_file.dict file (/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/data/models/wikipedia_full/tfidf_retriever/model.dict) exists, check to make sure either is correct. This may manifest as a shape mismatch later on. 10:19:28 | Loading /home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/data/models/wikipedia_full/tfidf_retriever/model.tfidf Traceback (most recent call last): File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/env/bin/parlai", line 11, in <module> load_entry_point('parlai', 'console_scripts', 'parlai')() File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/__main__.py", line 14, in main superscript_main() File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/script.py", line 307, in superscript_main return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/script.py", line 90, in _run_from_parser_and_opt return script.run() File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/scripts/interactive.py", line 117, in run return interactive(self.opt) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/scripts/interactive.py", line 83, in interactive agent = create_agent(opt, requireModelExists=True) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/agents.py", line 402, in create_agent model = create_agent_from_opt_file(opt) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/core/agents.py", line 355, in create_agent_from_opt_file return model_class(opt_from_file) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/agents/tfidf_retriever/tfidf_retriever.py", line 160, in __init__ self.ranker = TfidfDocRanker( File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/agents/tfidf_retriever/tfidf_doc_ranker.py", line 35, in __init__ matrix, metadata = utils.load_sparse_csr(tfidf_path) File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/ParlAI/parlai/agents/tfidf_retriever/utils.py", line 37, in load_sparse_csr (loader['data'], loader['indices'], loader['indptr']), shape=loader['shape'] File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/env/lib/python3.8/site-packages/numpy-1.20.0rc2-py3.8-linux-x86_64.egg/numpy/lib/npyio.py", line 254, in __getitem__ return format.read_array(bytes, File "/home/patrick-easton/Documents/CSA_Project_Patrick_Easton_2021/env/lib/python3.8/site-packages/numpy-1.20.0rc2-py3.8-linux-x86_64.egg/numpy/lib/format.py", line 766, in read_array array = numpy.ndarray(count, dtype=dtype) numpy.core._exceptions.MemoryError: Unable to allocate 8.50 GiB for an array with shape (1140833956,) and data type float64

I checked the model in the data and it says its size is 0 bytes. I also tried to bypass the issue by changing the overcommit settings using echo 1 > /proc/sys/vm/overcommit_memory but this instead caused my laptop to freeze whenever I tried to run the interactive command. Is this a memory issue? What level of GPU is required for interacting with the full Wikipedia Model. I only have one Intel Graphics Card (256M) and It says it needs a NVIDIA GPU. Is that necessary for interaction? Any help would be greatly appreciated.

SEMTEX99 commented 3 years ago

I recommend an Nvidia GPU, it speeds things up by a lot, personally running some larger models on just the CPU is expecting to wait forever, The freezing is normal on some less beefy machines, i usually leave the machine on for a couple of hours to do its thing, since if you dont have a beefy PC dont expect to be able to run the bigger models. also check your disk space if you have enough memory. can you run any of the smaller models normally or is it just the Wiki model?

PaddyE9797 commented 3 years ago

I was able to run the memnn model trained using the bAbI tasks as per the ParlAi Quick start, https://parl.ai/docs/tutorial_quick.html. I also run the blender_90M model from this page. https://parl.ai/projects/recipes/. I wanted to use the wiki model with the chat service to create a knowledge-based retrieval bot a user could interact with through a messenger interface and fine-tune the model where appropriate. I am using a Laptop and it's dual boot with Windows and Linux so the drive is partitioned. It's split between my root and home directories on Linux. I have about 98 GB on my hard drive (root) and 83 of it is available. Home has the same amount of GB but only 54 of it is available. I'm assuming I would need more than that. What is the recommended amount for using larger models?

stephenroller commented 3 years ago

That's more than enough space for this model.

It seems like there are connection issues, or something interrupting the download. Are you able to download the file with wget?

PaddyE9797 commented 3 years ago

Yes, I was able to download model.tgz using "wget" and the browser. When it's extracted through the model file is 0 bytes. Is that supposed to happen?. If so, after overcommitting the memory should I just leave the computer to run the model even if it freezes if that's supposed to happen?

stephenroller commented 3 years ago

No, the model file should be a couple gigabytes after extracting. Something seems wrong locally

PaddyE9797 commented 3 years ago

I was able to change my connection and download the model from scratch directly by using the parlai interactive --model tfidf_retriever -mf zoo:wikipedia_full/tfidf_retriever/model but the model file is still 0 bytes. Not sure what the problem is. I've tried downloading it in different ways but it's still the same result. Do you know what could potentially be causing the problem? Has something like this happened before?

stephenroller commented 3 years ago

Just looked at my own copy of data/models/wikipedia_full/tfidf_retriever. Here were the sizes and md5sums. Can you confirm these are all correct:

tfidf_retriever $ pwd
/private/home/roller/working/parlai/data/models/wikipedia_full/tfidf_retrieve

tfidf_retriever $ du -hcs *
0       model
14G     model.db
2.0K    model.opt
13G     model.tfidf.npz
26G     total

tfidf_retriever $ md5sum *
d41d8cd98f00b204e9800998ecf8427e  model
da09896b20c5ffd5ceaa3be9c7779d1f  model.db
1cfb5d25bf26d87b7988cbde7e883a0e  model.opt
fca0d7ef4760dfd9b0f5a8f242b3a77d  model.tfidf.npz
PaddyE9797 commented 3 years ago

`tfidf_retriever$ pwd /home/patrick-easton/Documents/CSA_Project_Patrick_Easton_ParlAI/ParlAI/data/models/wikipedia_full/tfidf_retriever

tfidf_retriever$ du -hcs * 0 model 14G model.db 4.0K model.opt 13G model.tfidf.npz 26G total

tfidf_retriever$ md5sum * d41d8cd98f00b204e9800998ecf8427e model da09896b20c5ffd5ceaa3be9c7779d1f model.db 1cfb5d25bf26d87b7988cbde7e883a0e model.opt fca0d7ef4760dfd9b0f5a8f242b3a77d model.tfidf.npz`

Seems to be the same though my model.opt file is a little larger. Could just be not enough memory. I may just have to give it some time for it to work when I run the interactive command though I did get a OOM kill message for parlai. I ran dmesg | grep -i kill after trying to execute the command and the process was killed after leaving it for a little bit.

stephenroller commented 3 years ago

Hm yes it downloaded correctly then. Seems like you OOM when loading it. It does prolly require a solid 32gb of RAM.

stephenroller commented 3 years ago

Can you increase the size of your swap partition?

PaddyE9797 commented 3 years ago

I didn't create a swap partition when installing Ubuntu. Would it be best to create one now? What size do you recommend?

stephenroller commented 3 years ago

1x-2x your ram is the common recommendation I think. Ubuntu installer should have done this for you...

I think you can make a swapfile without having to repartition but I don't recall how to do that.

github-actions[bot] commented 3 years ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.