Closed nimasadri11 closed 3 years ago
@nimasadri11 This error seems to indicate the number of generated predictions entries (qid
, docid
pairs) is less than expected. I wonder if the .tfrecord preparation on dev set was somehow interrupted?
@crystina-z gotcha! Thank you! I'm testing again, and hopefully close this issue in the next 36 hours if I don't face the same issue again : )
@crystina-z I was training this and it got stuck on .tfrecord number 42. The process was still running but wasn't generating any logs. Also, isn't MSMARCO supposed to take max 20 hours with these resources and parameters? Mine took over 36 hours and didn't finish.
#SBATCH --nodes=1
#SBATCH --gres=gpu:v100l:4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=48GB
#SBATCH --cpus-per-task=8
export CUDA_AVAILABLE_DEVICES=0,1,2,3
configs from config_msmarco.txt
and
lr=1e-3
bertlr=2e-5
itersize=30000
#itersize=30
warmupsteps=3000
#warmupsteps=3
decaystep=$itersize # either same with $itersize or 0
decaytype=linear
@nimasadri11 if the .tfrecord is named using number 1 to 300+, it should be .tfrecord for dev set and training was not really started. As for the reason it get stuck, any possibly that the disk if full and no more writing could be added to the log file? this would possibly happen if CAPREOLUS_CACHE
and CAPREOLUS_RESULTS
r unset so all files r stored under ~/.capreolus
.
Tho I found it on CC the disk access speed is somewhat slow-ish. If u wanna quickly check if any settings goes wrong, then we can rerank only the top100 rather than top1k documents, by adding threshold=100
. This would reduced a lot data preparation and inference time. And for additional speed up, could try to turn mix precision on by capreolus.reranker.trainer.amp=True
.
Awesome! I will give this a try and close this issue. Thank you @crystina-z
I am trying to replicate the MSMARCO results using the
feature/msmarco_psg
branch. However, I get the below error. It seems to be related to #52. Has anyone else encountered this?