codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

IndexError: list index out of range #52

Open ghost opened 5 years ago

marcwww commented 5 years ago

No description provided.

@Marsxia Check if the blank '\t' in your 'corpus.small' file. The examples in the readme file are not ready-to-use actually.

JasonLiu-THU commented 5 years ago

No description provided.

@Marsxia Check if the blank '\t' in your 'corpus.small' file. The examples in the readme file are not ready-to-use actually. But I have the blank '\t' in my file, I also met this problem.

riktimmondal commented 5 years ago

I am getting the same error but couldn't resolve it

vdpappu commented 5 years ago

@Marsxia @riktimmondal faced this problem. Text cleanup while generating text file fixed the issue. Cannot point out the specifics, but modifying below code to your case might help:

def cleanText(text):

    text = text.replace('\\n','')
    text = text.replace('\\','')
    #text = text.replace('\t', '')
    #text = re.sub('\[(.*?)\]','',text) #removes [this one]
    text = re.sub('(http:\/\/www\.|https:\/\/www\.|http:\/\/|https:\/\/)?[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(:[0-9]{1,5})?(\/.*)?\s',
                ' __url__ ',text) #remove urls
    #text = re.sub('\'','',text)
    #text = re.sub(r'\d+', ' __number__ ', text) #replaces numbers
    #text = re.sub('\W', ' ', text)
    text = re.sub(' +', ' ', text)
    text = text.replace('\t', '')
    text = text.replace('\n', '')
    return text
file_write = []

for file_ in file_list:
    curr_file = file_path+file_
    f_ = open(curr_file, "r")
    curr_text = f_.readlines()[0]
    curr_text = cleanText(curr_text)
    curr_text = curr_text[2:]
    curr_text_list = curr_text.split('.')
    if split_text in curr_text_list:
        curr_text_list_trim = curr_text_list[0:curr_text_list.index(split_text)]
    else:
        curr_text_list_trim = curr_text_list
    if len(curr_text_list_trim)>5:
        for ele in curr_text_list_trim:
            if len(ele)>10:
                file_write.append(ele.strip()+'.')
        file_write.append("")

#remove empty line at the end
file_write = file_write[0:len(file_write)-2]
junchen14 commented 5 years ago

I also met with this issue but cannot solve it. is there any body could help?

aluminumbox commented 5 years ago

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

MohamedLotfyElrefai commented 5 years ago

i tried using this 2 lines with duplicated of them in dataset Welcome to the \t the jungle\n I can stay \t here all night\n and i face the same error: image

iiiHunter commented 4 years ago

Change the code at line 23 in dataset.py, from split("\t") --> split("\t")

songyingxin commented 4 years ago

If you use the demo in README, change the code at line 23 in dataset.py, from split("\t") --> split("\\t").

qiaomeng commented 4 years ago

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

Thanks a lot! After a night of debug, I fix this problem. First change the code dataset.py at line 23: self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] than download this file https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. push this file to the $YOUPROJECT/data/,then input bert-vocab -c data/corpus.small -o data/vocab.small bert -c data/corpus.small -v data/vocab.small -o output/bert.model You can see the program can run normally.

limengqigithub commented 3 years ago

This is how I solve this problem. My corpus is like this: Welcome to the\tthe jungle I can stay\there all night And, change the code at line 23 in dataset.py (This py file is the py file where you reported the wrong location) from "self.lines = [line[:-1].split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]"
to "self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)]"

Emir-Liu commented 3 years ago

i also meet the same question,but i found the above solution is useless. then,i download the corpus.small as above says https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. i found all question is solved. i suspect it's the problem caused by editor ,it's rediculious. i find when i use vim ,i autoset \t as four space,this is the cause of the question. i open the corpus.small by ubuntu text editor to find this.

ps: when i solve the question,i try the following data again,there is no question.

Welcome to the \t the jungle\n
I can stay \t here all night\n
LeoRainly commented 3 months ago

I also met with this issue but cannot solve it. is there any body could help?

It is very easy to modify this problem. Just debug the code at line 23 in dataset.py.

Thanks a lot! After a night of debug, I fix this problem. First change the code dataset.py at line 23: self.lines = [line[:-1].replace("\n", "").split("\t") for line in tqdm.tqdm(f, desc="Loading Dataset", total=corpus_lines)] than download this file https://drive.google.com/file/d/1gdNG92VABX8eWc7JWnU7Y-1wa5cu5-0L/view?usp=sharing. push this file to the $YOUPROJECT/data/,then input bert-vocab -c data/corpus.small -o data/vocab.small bert -c data/corpus.small -v data/vocab.small -o output/bert.model You can see the program can run normally.

Thx for your share! What dataset is corpus.small and how did you find it ? :)