POS on the whole word without segmentation.

amal-meer commented 4 years ago

Greeting. I am trying to use the POS tagger. I run the following code.

pos_tagged_interactive = pos_tagger_interactive.tag('ذهب الطالب إلى المدرسة')
print("sample POS Tagged (interactive)",pos_tagged_interactive)
pos_tagger_interactive.terminate()

and got the result sample POS Tagged (interactive) S/S ذهب/V ال+/DET طالب/NOUN-MS إلى/PREP ال+/DET مدرس/NOUN-FS +ة/NSUFF E/E

The problem is that I need the tagger for each individual word without using the segmenter. i.e. without extracting the prefix and suffix. To clarify, I need the same output of the Farasa online demo as shown in the below picture.

Is there a way to have this output?

MagedSaeed commented 4 years ago

Hey @amal-meer

Actually, this is the output you get when you use the terminal-based farasa toolkit. What happens on the website, if I am not mistaken, is nothing but postprocessing to the output. I do not think they used the segmenter again, just a straightforward string split. I believe this postprocessing step should be available in this library too. We will try to work on it soon. If it happens that you did any work on that, please share so that we can integrate with it.

Thanks for sharing such insights.

MagedSaeed commented 4 years ago

For reference, this seems to be the original paper for Farasa POS tagging.

https://www.aclweb.org/anthology/W17-1316.pdf

amal-meer commented 4 years ago

Thanks for your response and for creating this helpful library. I will try to work on it and share the code if it works.

MagedSaeed commented 4 years ago

Thanks for your support @amal-meer

Regarding the problem we had in hand, I wrote a code to do the desegment the segmented tokens but keep the original tags for each segment. I think this is a fair choice. End-user should decide what to keep and what to remove from this list of tags.

Here is a sample output:

What do you think about this output?

amal-meer commented 4 years ago

I think this is the desired output for this problem, but I wonder if Farasa has another tagger that tags the whole word, as mentioned in the paper. -screenshots from the paper are below-.

MagedSaeed commented 4 years ago

That is interesting! I will try to check their APIs. Thanks for mentioning that! Better to keep this open and not commit anything until we come across a consistent approach.

MagedSaeed commented 4 years ago

Hey @amal-meer

In their paper, Farasa POS tagging systems authors mentioned that they developed two separate systems for POS tagging. I was expecting that they may include both systems in the same package they distribute through their website. But, unfortunately, there is no option to run the package on the word level or the clitic level! Although they mentioned that they open-sourced both systems, it seems that the word-level package is not available in the main website to download. If you have any idea how to get the second package, it will be great to include it in farasapy. Maybe communicating these concerns by email to farasa authors give some insights?

Other important point mentioned in the paper is that clitic level has better accuracy than word level. So, the current system in hand is more accurate.

amal-meer commented 4 years ago

I run the code on Java IDE rather than the terminal and followed the code written by Farasa on their website. I noticed that they segment the sentence and then tag it.

ArrayList<String> segOutput = farasa.segmentLine("النص المراد معالجته");
Sentence sentence = farasaPOS.tagLine(segOutput);

I change the code by splitting the sentence by space rather than segmenting and got the tag for the whole word.

This is the main function code

    public static void main(String[] args) throws IOException, FileNotFoundException, ClassNotFoundException,
            UnsupportedEncodingException, InterruptedException, Exception {

        Farasa farasa = new Farasa();
        FarasaPOSTagger farasaPOS = new FarasaPOSTagger(farasa);

        String s = "ذهب الطالب إلى   المدرسة";

        // Segmentation
        ArrayList<String> segOutput = farasa.segmentLine(s);
        System.out.println("Segment output: " + segOutput);

        Sentence clitictag = farasaPOS.tagLine(segOutput);

        for (Clitic w : clitictag.clitics) {
            System.out.println(w.surface + "/" + w.guessPOS + ((w.genderNumber != "") ? "-" + w.genderNumber : "") + " ");
        }

        // Spletting
        ArrayList<String> splitOutput = new ArrayList<>(Arrays.asList(s.split("\\s+")));
        System.out.println("\nSplit output: " + splitOutput);

        Sentence sentencetag = farasaPOS.tagLine(splitOutput);

        for (Clitic w : sentencetag.clitics) {
            System.out.println(w.surface + "/" + w.guessPOS + ((w.genderNumber != "") ? "-" + w.genderNumber : "") + " ");
        }
    }

and this is the output

MagedSaeed commented 4 years ago

This is great work @amal-meer, However, I want to let you know that farasapy spawns terminal processes to run farasa binaries. This, although it is tricky, provides some gains in speed. From my side, I tried multiple times to produce the same results on the terminal but, unfortunately, with no success. If you find a way to produce these results on the terminal, please let me know so that I can integrate it with farasapy. Otherwise, I do not think it will be easy to integrate due to that design choice! Nevertheless, you can use other alternatives to achieve your goal such as Py4J [https://www.py4j.org/index.html] which calls java classes from within Python. Keep in mind that you may sacrifice some speed in that approach.

amal-meer commented 4 years ago

I've tried the terminal but couldn't get any result. I think the desegment solution is the only way to solve this problem in farasapy. I appreciate if you could share or commit the code.

And thanks for your fast response and for suggesting Py4J

MagedSaeed commented 4 years ago

That is great!

I will try to integrate it soon. Please stay tuned. Moreover, I found out that there are other options I can add. I will keep you updated.

MagedSaeed commented 4 years ago

@amal-meer

I pushed the changes to the repository. You can test them on this notebook:

https://colab.research.google.com/drive/1sFDta9U3t2iXX_PXjpXTl1L-G02EWPmH?usp=sharing

Please let me know if you have any comments. I will, then, do the tests and push to PyPI.

Thanks in advance,.

MagedSaeed commented 4 years ago

@amal-meer

Any updates?

amal-meer commented 4 years ago

I just test the notebook and it works well 👍. Thanks for your effort.

MagedSaeed commented 4 years ago

@amal-meer

changes are pushed to PyPI, please upgrade the package and test.

If you do not have any concerns, please close the issue.

Many Thanks in advance,

amal-meer commented 3 years ago

Everything work fine. One last suggestion is to add an example of tag_segments function to the interactive Google colab code in How to use section of the README file.

Thanks for your time and effort.

MagedSaeed commented 3 years ago

Thanks for closing the issue @amal-meer , I already added it to the interactive colab session but forgot to add it to the README.md. I will consider that. Thanks for the suggestion.

MagedSaeed / farasapy

POS on the whole word without segmentation. #9