Using other models than UDPipe

fahadshery commented 6 years ago

Hi again, I am back :)

I have finally created my own model using the example provided in the docs. This worked perfectly and no issues at all. However, when I use your RDRPOSTagger package to do tokenisation, I ran into weird behavior. The complete re-producable example is here.

Here are the issues/info:

x <- crf_cbind_attributes(x1, terms = c("upos", "lemma"), by = "doc_id") creates 68 cols in total when executed on UDPipe POS and tokenised dataframe whereas if you do the same for RDRPOSTagger POS and tokenised dataframe, it creates only 36 cols.
The chunk_entity col is not prefixed by I, B or O as it does for UDPipe tokenised and POS dataframe
It introduces duplicates for RDRPOSTagger tokenised dataframe but it doesn't duplicate if the UDPipe model is used.
You will have to change places when calling the merge(crfsuite_annotation_verbatim_to_annotate,rdr_tagging) method (in docs, you pass the annotated object first and then the y object) it throughs an error Error in merge.chunkrange(crfsuite_annotation_verbatim_to_annotate, rdr_tagging) : all(c(by.y, "start", "end") %in% colnames(y)) is not TRUE. But if you look at UDPipe tokenised dataframe. This also doesn't have start and end cols. I fixed it by changing the position of the method call by: x <- merge(rdr_tagging,crfsuite_annotation_verbatim_to_annotate)

I wrote a complete n00b getting starting guidehere.

thanks

jwijffels commented 6 years ago

Great showoff! I briefly skimmed your starting guide and briefly skipped your question. I think there is a problem and everything you indicate points to that issue.

Namely: The merge part x <- merge(crfsuite_annotation_verbatim_to_annotate, verbatim_tokens) assumes really that you have the fields start and end in the verbatim_tokens. You need to use verbatim_tokens <- as.data.frame(verbatim_tokens, detailed = TRUE) to get this. Or just use udpipe(verbatims, udmodel)

Your current code does not do this. As a consequence your training dataset is wrong.
With RDRPOSTagger, you never get start/end fields in your tokenised dataset... so that will not work.
If you are using RDRPOSTagger for commercial reasons, why not consider to do: udpipe(verbatims, "english", udpipe_model_repo = "bnosac/udpipe.models.ud") that will tokenise with a commercially fine model which is downloaded from here: https://github.com/bnosac/udpipe.models.ud

Keep up the spirit!

fahadshery commented 6 years ago

This is awesome! I will test your points and report back

fahadshery commented 6 years ago

all worked well. thank you. I am using the udpipe(verbatims, "english", udpipe_model_repo = "bnosac/udpipe.models.ud") now for commercial reasons. working well now

bnosac / crfsuite

Using other models than UDPipe #8