Problems setting up pipeline

Georgetown-IR-Lab / OpenNIR

An end-to-end neural ad-hoc ranking pipeline.

https://opennir.net

MIT License

150 stars 25 forks source link

Problems setting up pipeline #18

Closed DavidHalman closed 4 years ago

DavidHalman commented 4 years ago

Hi, I'm trying to set up OpenNIR for a school project, but I'm running into some issues. I'm included a picture of the error messages, has anyone run into something like this? Googling didn't help me.

seanmacavaney commented 4 years ago

It looks like you're using the "dummy" dataset. Do you get the same issue when running for a different dataset, e.g., config/antique?

DavidHalman commented 4 years ago

I get the same error when using the following command. I believe the error occurs after attempting to download the antique dataset.

bash scripts/pipeline.sh config/antique config/vanilla_bert

seanmacavaney commented 4 years ago

Gotcha. Can you try changing this line to return 1? https://github.com/Georgetown-IR-Lab/OpenNIR/blob/master/onir/util/concurrency.py#L11

DavidHalman commented 4 years ago

Ok, I made that suggested change, now running the same command gives the following:

I tried deleting the entirety of /data/ and running again, thinking that perhaps the previous failed attempts had messed with that data, but I got the same error.

seanmacavaney commented 4 years ago

My guess is that it's a race condition. I've only run this in multi-processor environments, so let's try to take all the threading out.

Can you try replacing:

                doc_iters = util.blocking_tee(doc_iter, len(needs_docs))
                for idx, it in zip(needs_docs, doc_iters):
                    stack.enter_context(util.CtxtThread(functools.partial(idx.build, it)))

with

                import itertools
                doc_iters = itertools.tee(doc_iter, len(needs_docs))
                for idx, it in zip(needs_docs, doc_iters):
                    idx.build(it)

here: https://github.com/Georgetown-IR-Lab/OpenNIR/blob/1a476aa8ef834c385099647b4a9fb3e10a52a1ec/onir/datasets/index_backed.py#L130

Note that this will no longer stream and index the documents in parallel; it will keep everything in memory. So while this may work for a smaller collection like ANTIQUE, you may encounter out-of-memory errors for larger collections. If that happens, we can probably adjust more to do multiple passes over the dataset.

DavidHalman commented 4 years ago

Thank you so much! Seems to be working now.

Quick question, I don't currently have Cuda installed even though I do have a GPU on the machine. A prompt said that installing Cuda would improve the speed of training. Do you have a rough estimate of how much faster the training is with a GPU?

seanmacavaney commented 4 years ago

Awesome- glad we got it working. I'll open a new issue that summarizes these particular issues. The speedup depends on the particular GPU, dataset, and ranking architecture. In most cases, you could reasonably expect it to be at least 10x as fast, though, so I'd recommend installing cuda.