TakeLab / spacy-udpipe

spaCy + UDPipe
MIT License
159 stars 11 forks source link

Extract morphological features for Hindi #27

Closed Ksartik closed 3 years ago

Ksartik commented 3 years ago

I want to extract morphological features out of a token (equivalent to FEATS of CONLL-U format). Basically an output as we get in UDPipe for R would be sufficient but it seems so difficult in this.

I am using a model for Hindi and I parsed a dummy text. I ran the UDPipe on R and it shows detailed features for each token. But when I run it on spacy-udpipe, I can't seem to understand how to extract these features. I checked .morph and there are many attributes which I want to know but all of them are empty ('').

I also tried checking spacy's hinted solution. I used nlp.vocab.morphology.tag_map but this dictionary only has 19 POS tags (which are all UPOS) and has no feature information (each one is a dictionary with only 74 as a key).

Interestingly, I checked Dutch and it seems to have a much more dense tag_map. But since UDPipe for R is showing features for Hindi as well, it should also be there for Python.

Is it possible to extract these features ? It would be a great help.

asajatovic commented 3 years ago

I believe it is possible! It would be very helpful if you could provide a concrete example. 😄

asajatovic commented 3 years ago

@Ksartik Just an update - spaCy v3.0 will fully integrate morphological attributes so expect to see them here soonish! :tada:

Ksartik commented 3 years ago

This is exactly what I wanted. Thanks. I am closing this issue then.