Open xiaokening opened 1 year ago
Hi xiaokening, the issue is caused by pre-tensorized prob.X_text
has larger instance index than the partitioned chunk size (30000). This should not happen if prob.X_text
is not tensorized (list of str).
If you want to manually truncated predict, one simple workaround is to turn off the train_params.pre_tokenize
so every chunk of data will be tensorized independently.
thanks! @jiong-zhang
@jiong-zhang When I train xtransformer with pecos model, the same training error occurs in the matcher stage. At first I thought that my data volume was too large, but when I increased the memory, this problem would still appear. This problem may occur in any matcher stage(I don't manually truncate predict)
I use the top and free commands to monitor the running of the program. I found that the number of processes suddenly increased and then disappeared. I suspect it is a problem with the dataloader. You can refer to this link
note:after the matcher fine-tuning was completed, it got stuck when predicting the training data at first step, look pecos.xmc.xtransformer.matcher
can you give me some adivce? Thanks
Description
When I train xtransformer with pecos model, a training error occurs in the matcher stage. the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher
I think it is caused by my training data set is too large,so I modified the code snippet of pecos.xmc.xtransformer.matcher。
But another problem happened, see the training log below。
05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5 05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333 05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000 05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335 05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000 Traceback (most recent call last): File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in
do_train(args)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 548, in do_train
xtf = XTransformer.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 447, in train
res_dict = TransformerMatcher.train(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 1402, in train
P_trn, inst_embeddings = matcher.predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 662, in predict
cur_P, cur_embedding = self._predict(
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 812, in _predict
cur_act_labels = csr_codes_next[inputs["instance_number"].cpu()]
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 47, in getitem
row, col = self._validate_indices(key)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 159, in _validate_indices
row = self._asindices(row, M)
File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 191, in _asindices
raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (30255) out of range
I'm not sure if this is a bug, can you give me some advice? Thanks!
Environment