explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Spacy Tokenizer Boundary Issue. #69

Closed ZohaibRamzan closed 3 years ago

ZohaibRamzan commented 3 years ago

I am using spacy tokenizer within stanza pipeline. In some of the sentences, spacy tokenizer does not tokenize sentence ending point '.' as seperate token which in my case is needed. Here is my code;

nlp= stanza.Pipeline('en', processors={'tokenize':'spacy'})

sentence='To 10-30mm2 section of stained material in a 2ml microfuge tube, add 600µl Lysis Buffer and 10µl Proteinase K.' sentence=sentence.rstrip() doc=nlp(unidecode(sentence)) # initialize stanza pipeline for every new sentence token=[word.text for sent in doc.sentences for word in sent.words]

The result is; ["To","10","-","30mm2","section","of","stained","material","in","a","2ml","microfuge","tube",",","add","600ul","Lysis","Buffer","and","10ul","Proteinase","K."] I want last two tokens as 'K' and '.' . Can i do that?

polm commented 3 years ago

Don't open the same issue in two places.

Closing as dupe of https://github.com/explosion/spaCy/issues/7592.