bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
213 stars 33 forks source link

handle both "sent_id" and "sentence_id" for udpipe_read_conllu #122

Closed HedvigS closed 8 months ago

HedvigS commented 8 months ago

Sometimes some conllu files use sent_id instead of the more common sentence_id.

Could udpipe.:udpipe_read_conllu() please be configured to handle both cases.

Love the package otherwise, great stuff!

HedvigS commented 8 months ago

For myself, I've written a little script now that checks each conllu file, if it's got "sent_id" in it they are all replaced with "sentence_id". Hacky workaround, but will work (fingers crossed) :)

jwijffels commented 8 months ago

https://github.com/bnosac/udpipe/blob/master/R/udpipe_parse.R#L259 and https://github.com/bnosac/udpipe/blob/master/R/udpipe_parse.R#L265 do check for sent_id

HedvigS commented 8 months ago

Hm... right. When I tried reading in a few that used "sent_id" then udpipe_read_conlli" didn't work. I changed to sentence_id and then it did work. That's a bit odd then.

HedvigS commented 8 months ago

I'll do some more bug testing.

jwijffels commented 8 months ago

Can you provide a reproducible example?

HedvigS commented 8 months ago

Can you provide a reproducible example?

I'm sorry to have bothered you, I just tried it again and the problem is something else. I was mistaken in thinking it was about "sent_id" vs "sentence_id". Sorry and thank you for your patience and for this wonderful package!