Closed larsbun closed 1 year ago
Hi @larsbun I suppose there is a specific sentence in your dataset that leads to the problem. You can use the following code to find all such sentences:
corpus = ....
embeddings = ...
invalid_sentences = []
for sentence in corpus.get_all_sentences():
try:
embeddings.embed(sentence)
except:
invalid_sentences.append(sentence)
print("There are", len(invalid_sentences), "invalid sentences")
print(invalid_sentences[0])
Can you please run this code twice to check if it is consistent and if it is, please share an example that fails?
Hi,
thanks for your pointer. Indeed, the data was faulty, with higher-order utf8 characters causing it to stop:
''' There are 1 invalid sentences Sentence[1]: "" → [""/c] '''
This didn't occur to me as a the reason, as the FlairEmbeddings tackled the data without problem. Anyway, now it's out there and searchable if others experience same.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Question
With a setup such as this:
and a tagger like this:
trained like this:
My script fails like this:
i.e., after the first epoch has completed. I see that there is a mismatch between the presentation of and the expectation for the embeddings, but it is not clear to me how to fix it. Is there something I misunderstand conceptually with the embeddings? If I use flairembeddings instead, the same setup works without fault.