BrikerMan / Kashgari

Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
http://kashgari.readthedocs.io/
Apache License 2.0
2.39k stars 441 forks source link

[Question] IndexError: index 100 is out of bounds for axis 0 with size 100 when using StackEmbedding #169

Closed ishita-gupta98 closed 5 years ago

ishita-gupta98 commented 5 years ago

(I had asked a similar question before https://github.com/BrikerMan/Kashgari/issues/158 , it may provide some context) My understanding is that when using Stackembedding we have to ensure every layer has the same sequence length so that it can be concatenated.

Problem: I am running the code with the same sequence length for all the features and the embedding. My embedding is running when I do training but I am still getting an index out of bounds with axis error. Also my training results are overfitted a lot even though I am using a 90K dataset so I think maybe the index error is the reason ?

My code looks like this after modifying (these are just the parts with sequence length involved)

#SEQUENCE LENGTH 
SEQUENCE_LEN = 100
#BERT MODEL 

bert_model_path= './bert/uncased_L-12_H-768_A-12'
bert_embedding = BERTEmbedding(bert_model_path,
                               task=kashgari.LABELING,
                               sequence_length=SEQUENCE_LEN)
tokenizer = bert_embedding.tokenizer

feature1_embedding = NumericFeaturesEmbedding(feature_count=1000,
                                                feature_name='feature1',
                                                sequence_length=SEQUENCE_LEN)

feature2_embedding = NumericFeaturesEmbedding(feature_count=1000,
                                                feature_name='feature2',
                                                sequence_length=SEQUENCE_LEN)

feature3_embedding = NumericFeaturesEmbedding(feature_count=1000,
                                                feature_name='feature3',
                                                sequence_length=SEQUENCE_LEN)
tokenized_text_list_train = sentences_tokenized_train

feature1_list_train = [list(train_df['feature1'])]*SEQUENCE_LEN
feature2_list_train = [list(train_df['feature2'])]*SEQUENCE_LEN
feature3_list_train = [list(train_df['feature3'])]*SEQUENCE_LEN
label_list_train = [list(train_df['label'])]*SEQUENCE_LEN

Since I have made the embedding and the features into the same length list, I should not get an index out of bounds with axis error. But when I run print(stack_embedding.embed(train_x)) I am getting the following error IndexError: index 100 is out of bounds for axis 0 with size 100

and one more small question (unrelated to this issue) the documentation mentioned leaving 0 for padding , does that mean the feature values in the dataset shouldn't be 0 ?

Thank you for helping out :)

ishita-gupta98 commented 5 years ago

update: I have kind of figured out from where the error is coming.

As I am using sentences which are tokenised by BERT as lists my text list is creating a list within a list of some kind. I am not sure but either ways if I remove the tokenised text I am feeding with a simple list given in your example like ['NLP', 'Projects', 'Project', 'Name', ':'] the error goes away.

I am still confused about what this means though, is stackEmbedding not compatible with BERTEmbedding sentence pairs ?

BrikerMan commented 5 years ago

You need to input a 3D Array-like [['token1', 'token2']] for embedding layer, whether using the sentence pair. Are you sure your data struct is right?

ishita-gupta98 commented 5 years ago

Thanks for the tip, I rechecked my data and corrected the input. Before this I was using the same bert tokeniser code written here https://kashgari.bmio.net/embeddings/bert-embedding/#example-usage-text-classification . I think what was happening was when we used that code we got an array like [['[CLS]', 'token1', '[SEP]', 'token2', '[SEP]'], ['[CLS]', 'token3' , '[SEP]', 'token4', '[SEP]']] which was giving list is unhashable error. So I changed my input to be of the format below and it stopped giving an error.

Though just to confirm (as I dont want to go wrong in my embedding) the tokenized_text_list should be of the form [['[CLS]', 'token1', '[SEP]', 'token2', '[SEP]', '[CLS]', 'token3' , '[SEP]', 'token4', '[SEP]']] ?

update: using the BertEmbedding with analyze_corpus in stackEmbedding gives a list is unhashable error, this is when i have my dataset who's x_train = ([['[CLS]', 'jim', 'henson', 'was', 'a', 'puppet', '##eer', '.', '[SEP]'], ['[CLS]', 'why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?', '[SEP]']], [[1, 2]], [[2, 1]], [[1, 2]])

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.