bnosac / crfsuite

Labelling Sequential Data in Natural Language Processing with R - using CRFsuite
Other
62 stars 12 forks source link

Using other models than UDPipe #8

Closed fahadshery closed 6 years ago

fahadshery commented 6 years ago

Hi again, I am back :)

I have finally created my own model using the example provided in the docs. This worked perfectly and no issues at all. However, when I use your RDRPOSTagger package to do tokenisation, I ran into weird behavior. The complete re-producable example is here.

Here are the issues/info:

  1. x <- crf_cbind_attributes(x1, terms = c("upos", "lemma"), by = "doc_id") creates 68 cols in total when executed on UDPipe POS and tokenised dataframe whereas if you do the same for RDRPOSTagger POS and tokenised dataframe, it creates only 36 cols.
  2. The chunk_entity col is not prefixed by I, B or O as it does for UDPipe tokenised and POS dataframe
  3. It introduces duplicates for RDRPOSTagger tokenised dataframe but it doesn't duplicate if the UDPipe model is used.
  4. You will have to change places when calling the merge(crfsuite_annotation_verbatim_to_annotate,rdr_tagging) method (in docs, you pass the annotated object first and then the y object) it throughs an error Error in merge.chunkrange(crfsuite_annotation_verbatim_to_annotate, rdr_tagging) : all(c(by.y, "start", "end") %in% colnames(y)) is not TRUE. But if you look at UDPipe tokenised dataframe. This also doesn't have start and end cols. I fixed it by changing the position of the method call by: x <- merge(rdr_tagging,crfsuite_annotation_verbatim_to_annotate)

I wrote a complete n00b getting starting guidehere.

thanks

jwijffels commented 6 years ago

Great showoff! I briefly skimmed your starting guide and briefly skipped your question. I think there is a problem and everything you indicate points to that issue.

Namely: The merge part x <- merge(crfsuite_annotation_verbatim_to_annotate, verbatim_tokens) assumes really that you have the fields start and end in the verbatim_tokens. You need to use verbatim_tokens <- as.data.frame(verbatim_tokens, detailed = TRUE) to get this. Or just use udpipe(verbatims, udmodel)

Keep up the spirit!

fahadshery commented 6 years ago

This is awesome! I will test your points and report back

fahadshery commented 6 years ago

all worked well. thank you. I am using the udpipe(verbatims, "english", udpipe_model_repo = "bnosac/udpipe.models.ud") now for commercial reasons. working well now