the format of input - Githubissues

codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation

Apache License 2.0

6.11k stars 1.29k forks source link

the format of input #41

Open eveliao opened 5 years ago

eveliao commented 5 years ago

You mentioned that

NOTICE : Your corpus should be prepared with two sentences in one line with tab(\t) separator

and gave an example:

Welcome to the \t the jungle\n I can stay \t here all night\n

However, the example is actually ONE sentence in one line. Should it be:

Welcome to the jungle \t I can stay here all night\n

(suppose these two sentences are continuous in the broader context)

codertimo commented 5 years ago

I mean it could be Two piece of one sentence not actually real sentence. Well It doesn't matter both two sentences and one sentence. And this example is came out from the original paper. So.. you can choose whatever you want 👍

JustinLin610 commented 5 years ago

I am interested in the prediction of next sentence. If the input data are all continuous sentences, how can the model randomly select 50% for the continuous and 50% for the discontinuous?

andy-yangz commented 5 years ago

And I also think if you mention there need spaces around '\t' is better, unless we will have more vocabs if we don't have spaces.

PandasPan commented 3 years ago

Yeah, this is clear then.

I mean it could be Two piece of one sentence not actually real sentence. Well It doesn't matter both two sentences and one sentence. And this example is came out from the original paper. So.. you can choose whatever you want 👍