capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

TPU config options #149

Closed Tooba-ts1700550 closed 3 years ago

Tooba-ts1700550 commented 3 years ago

I am trying to use the TPU in colab for running this code, using the python code in the example notebook. The docs say that set the following config options tpuname, tpuzone and storage to run on the TPU. Can you please give an example of where to set these config params?

Thank you.

andrewyates commented 3 years ago
Tooba-ts1700550 commented 3 years ago

Thank you for your reply. I can't figure out how to find the tpu name and zone in google colab? I think Im doing something wrong here:

!capreolus rerank.traineval with \
  rank.searcher.index.stemmer=porter benchmark.name=nf \
  rank.searcher.name=BM25RM3 \
  rank.optimize=recall_1000 reranker.name=KNRM reranker.trainer.niters=2 optimize=P_20 \
  reranker.trainer.tpuname=mytpu1 reranker.trainer.tpuzone=us-central1-f

I recieved this error: profane.exceptions.InvalidConfigError: received unknown config key: tpuname

andrewyates commented 3 years ago

Regarding the InvalidConfigError, this is because you're using the PyTorch implementation of KNRM rather than the Tensorflow one. You can change this by adding reranker.name=TFKNRM. The TPU docs list a few other rerankers with TF support. Just so you know, in our tests KNRM was actually faster on GPU than on TPU. It's a very small model, so the TPU communication overheard overwhelms any performance increase.

I'm not very familiar with Colab TPUs, but based on this it looks like Colab provides a TPU address rather than the name.

If you want to try making that change yourself, this is where to do it: https://github.com/capreolus-ir/capreolus/blob/master/capreolus/trainer/tensorflow.py#L75

I think the updated code would look something like this:

if self.config["tpuname"] == "COLAB" and self.config["tpuzone"] == "COLAB":
    logger.debug("connecting to Colab TPU at %s", os.environ["COLAB_TPU_ADDR"])
    self.tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="grpc://" + os.environ["COLAB_TPU_ADDR"])
else:
    self.tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu=self.config["tpuname"], zone=self.config["tpuzone"])

Alternatively, I could make the change myself, but it will be a week or so before I have time to make and test it.

Note that you also need to specify a GCS storage bucket, such as reranker.trainer.storage=gs://your-bucket/abc/. You'll need to give Colab write access to the GCS bucket that you create. I don't know the best way to do this offhand. The stackoverflow page I linked mentions one way to find the service account being used by Colab, so one option is to find that service account and then manually grant it access to the bucket.

Tooba-ts1700550 commented 3 years ago

Thank you very much for your detailed reply. I tried to make the fix you suggested. It recognizes the Colab TPU, but I get the following error for storage: ValueError: invalid 'key=value' pair: reranker.trainer.storage=

This is my line of code: reranker.trainer.tpuname="COLAB" reranker.trainer.tpuzone="COLAB" reranker.trainer.storage= "gs://capreolus-bucket/cap-results-parade/" \

I have the GCS bucket already, and the path is correct to my knowledge.

andrewyates commented 3 years ago

I think this is due to the extra space between reranker.trainer.storage= and "gs://.... This should work: reranker.trainer.tpuname="COLAB" reranker.trainer.tpuzone="COLAB" reranker.trainer.storage="gs://capreolus-bucket/cap-results-parade/"