Closed ishita-gupta98 closed 5 years ago
update: I have kind of figured out from where the error is coming.
As I am using sentences which are tokenised by BERT as lists my text list is creating a list within a list of some kind. I am not sure but either ways if I remove the tokenised text I am feeding with a simple list given in your example like ['NLP', 'Projects', 'Project', 'Name', ':'] the error goes away.
I am still confused about what this means though, is stackEmbedding not compatible with BERTEmbedding sentence pairs ?
You need to input a 3D Array-like [['token1', 'token2']] for embedding layer, whether using the sentence pair. Are you sure your data struct is right?
Thanks for the tip, I rechecked my data and corrected the input. Before this I was using the same bert tokeniser code written here https://kashgari.bmio.net/embeddings/bert-embedding/#example-usage-text-classification . I think what was happening was when we used that code we got an array like
[['[CLS]', 'token1', '[SEP]', 'token2', '[SEP]'], ['[CLS]', 'token3' , '[SEP]', 'token4', '[SEP]']] which was giving list is unhashable
error. So I changed my input to be of the format below and it stopped giving an error.
Though just to confirm (as I dont want to go wrong in my embedding) the tokenized_text_list should be of the form [['[CLS]', 'token1', '[SEP]', 'token2', '[SEP]', '[CLS]', 'token3' , '[SEP]', 'token4', '[SEP]']] ?
update: using the BertEmbedding with analyze_corpus in stackEmbedding gives a list is unhashable
error, this is when i have my dataset who's x_train = ([['[CLS]', 'jim', 'henson', 'was', 'a', 'puppet', '##eer', '.', '[SEP]'], ['[CLS]', 'why', 'did', 'the', 'chicken', 'cross', 'the', 'road', '?', '[SEP]']], [[1, 2]], [[2, 1]], [[1, 2]])
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
(I had asked a similar question before https://github.com/BrikerMan/Kashgari/issues/158 , it may provide some context) My understanding is that when using Stackembedding we have to ensure every layer has the same sequence length so that it can be concatenated.
Problem: I am running the code with the same sequence length for all the features and the embedding. My embedding is running when I do training but I am still getting an index out of bounds with axis error. Also my training results are overfitted a lot even though I am using a 90K dataset so I think maybe the index error is the reason ?
My code looks like this after modifying (these are just the parts with sequence length involved)
Since I have made the embedding and the features into the same length list, I should not get an index out of bounds with axis error. But when I run
print(stack_embedding.embed(train_x))
I am getting the following errorIndexError: index 100 is out of bounds for axis 0 with size 100
and one more small question (unrelated to this issue) the documentation mentioned leaving 0 for padding , does that mean the feature values in the dataset shouldn't be 0 ?
Thank you for helping out :)