A problem when using the pecos model to train xtransformer

xiaokening commented 1 year ago

Description

When I train xtransformer with pecos model, a training error occurs in the matcher stage. the size of dataset is 108457, Hierarchical label tree: [32, 1102]。In the matcher stage, when I was training the second layer of label trees(There is no problem when training the first layer of label trees), after the matcher fine-tuning was completed, it got stuck when predicting the training data, look pecos.xmc.xtransformer.matcher

I think it is caused by my training data set is too large，so I modified the code snippet of pecos.xmc.xtransformer.matcher。

P_trn, inst_embeddings = matcher.predict(
                prob.X_text,
                csr_codes=csr_codes,
                pred_params=pred_params,
                batch_size=train_params.batch_size,
                batch_gen_workers=train_params.batch_gen_workers,
                max_pred_chunk=30000,
            )

But another problem happened, see the training log below。

05/08/2023 10:31:56 - INFO - pecos.xmc.xtransformer.matcher - Reload the best checkpoint from /tmp/tmp0kdzh7n5 05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.31423333333333 05/08/2023 10:31:58 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000 05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.matcher - Predict with csr_codes_next((30000, 1102)) with avr_nnz=172.2335 05/08/2023 10:32:29 - INFO - pecos.xmc.xtransformer.module - Constructed XMCTextTensorizer, tokenized=True, len=30000 Traceback (most recent call last): File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/envs/nlp/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 564, in do_train(args) File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/train.py", line 548, in do_train xtf = XTransformer.train( File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/model.py", line 447, in train res_dict = TransformerMatcher.train( File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 1402, in train P_trn, inst_embeddings = matcher.predict( File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 662, in predict cur_P, cur_embedding = self._predict( File "/opt/conda/envs/nlp/lib/python3.8/site-packages/pecos/xmc/xtransformer/matcher.py", line 812, in _predict cur_act_labels = csr_codes_next[inputs["instance_number"].cpu()] File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 47, in getitem row, col = self._validate_indices(key) File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 159, in _validate_indices row = self._asindices(row, M) File "/opt/conda/envs/nlp/lib/python3.8/site-packages/scipy/sparse/_index.py", line 191, in _asindices raise IndexError('index (%d) out of range' % max_indx) IndexError: index (30255) out of range

I'm not sure if this is a bug, can you give me some advice? Thanks!

Environment

Operating system: Ubuntu 20.04.4 LTS container
Python version: Python 3.8.16
PECOS version:libpecos 1.0.0

jiong-zhang commented 1 year ago

Hi xiaokening, the issue is caused by pre-tensorized prob.X_text has larger instance index than the partitioned chunk size (30000). This should not happen if prob.X_text is not tensorized (list of str).

If you want to manually truncated predict, one simple workaround is to turn off the train_params.pre_tokenize so every chunk of data will be tensorized independently.

xiaokening commented 1 year ago

thanks! @jiong-zhang

xiaokening commented 10 months ago

@jiong-zhang When I train xtransformer with pecos model, the same training error occurs in the matcher stage. At first I thought that my data volume was too large, but when I increased the memory, this problem would still appear. This problem may occur in any matcher stage（I don't manually truncate predict)

I use the top and free commands to monitor the running of the program. I found that the number of processes suddenly increased and then disappeared. I suspect it is a problem with the dataloader. You can refer to this link

note:after the matcher fine-tuning was completed, it got stuck when predicting the training data at first step, look pecos.xmc.xtransformer.matcher

can you give me some adivce? Thanks

Environment

Operating system: Ubuntu 20.04.4 LTS container
Python version: Python 3.8.16
PECOS version:libpecos 1.0.0
pytorch version: pytorch==1.11.0
cuda version: 4 x NVIDIA V100 16GB;cudatoolkit=11.3

amzn / pecos