OctoberChang / X-Transformer

X-Transformer: Taming Pretrained Transformers for eXtreme Multi-label Text Classification
BSD 3-Clause "New" or "Revised" License
134 stars 28 forks source link

Issue with training stage #3

Closed simonlevine closed 4 years ago

simonlevine commented 4 years ago

Hi, when trying to run the pipeline on new data, I keep getting this error:

File "xbert/transformer.py", line 561, in train labels = np.array(C_trn[inst_idx].toarray()) ... IndexError: index (17407) out of range

In other words, the C_trn array is being sliced at an index that is too large. I'm wondering if there is a way to fix this? I've tried with Eurlex-4k and it runs fine. I did notice the following: Num examples is listed at a number not equal to the size of my C_trn array, whereas in running Eurlex-4k I found that this Num examples (printed during training run) is equal to the C_trn row dimension. Thanks!

simonlevine commented 4 years ago

Edit: I've found the culprit in the code but still can't figure out an exact fix. So, C_trn and X_trn should have the same dimensions (ie, len(X_trn) should equal the number of rows of C_trn). However, this is not happening for some reason when, for instance, X.trn.bert.128.pkl is generated. I really am lost on this one! Any help is much appreciated. Thanks.

OctoberChang commented 4 years ago

I expect the X_trn, Y_trn, C_trn have the following shape:

If this is not the case, I suggest you check https://github.com/OctoberChang/X-Transformer/blob/master/xbert/preprocess.py#L262

simonlevine commented 4 years ago

I expect the X_trn, Y_trn, C_trn have the following shape:

  • X_trn.shape = (n_trn, d_tfidf)
  • Y_trn.shape = (n_trn, n_label)
  • C_trn.shape = (n_trn, n_cluster) where X_trn, Y_trn, and C_trn should all have the same number of instances, n_trn.

If this is not the case, I suggest you check https://github.com/OctoberChang/X-Transformer/blob/master/xbert/preprocess.py#L262

Hey there, we fixed it! It turned out that out preprocessed data needed to be filtered for newline characters within each line. I will drop a PR if you'd like with an updated function for xbert.preprocess. This was a tiny fix but ended up being a major headache. Appreciate your hard work, one tartan to another :)

OctoberChang commented 4 years ago

Glad you figure it out. Yes, please drop a PR, it will be very useful for other interested users. Thanks.