Closed simonlevine closed 4 years ago
Edit: I've found the culprit in the code but still can't figure out an exact fix. So, C_trn and X_trn should have the same dimensions (ie, len(X_trn) should equal the number of rows of C_trn). However, this is not happening for some reason when, for instance, X.trn.bert.128.pkl is generated. I really am lost on this one! Any help is much appreciated. Thanks.
I expect the X_trn, Y_trn, C_trn have the following shape:
If this is not the case, I suggest you check https://github.com/OctoberChang/X-Transformer/blob/master/xbert/preprocess.py#L262
I expect the X_trn, Y_trn, C_trn have the following shape:
- X_trn.shape = (n_trn, d_tfidf)
- Y_trn.shape = (n_trn, n_label)
- C_trn.shape = (n_trn, n_cluster) where X_trn, Y_trn, and C_trn should all have the same number of instances, n_trn.
If this is not the case, I suggest you check https://github.com/OctoberChang/X-Transformer/blob/master/xbert/preprocess.py#L262
Hey there, we fixed it! It turned out that out preprocessed data needed to be filtered for newline characters within each line. I will drop a PR if you'd like with an updated function for xbert.preprocess. This was a tiny fix but ended up being a major headache. Appreciate your hard work, one tartan to another :)
Glad you figure it out. Yes, please drop a PR, it will be very useful for other interested users. Thanks.
Hi, when trying to run the pipeline on new data, I keep getting this error:
In other words, the C_trn array is being sliced at an index that is too large. I'm wondering if there is a way to fix this? I've tried with Eurlex-4k and it runs fine. I did notice the following: Num examples is listed at a number not equal to the size of my C_trn array, whereas in running Eurlex-4k I found that this Num examples (printed during training run) is equal to the C_trn row dimension. Thanks!