Closed amal-meer closed 3 years ago
Hey @amal-meer
Actually, this is the output you get when you use the terminal-based farasa toolkit. What happens on the website, if I am not mistaken, is nothing but postprocessing to the output. I do not think they used the segmenter again, just a straightforward string split. I believe this postprocessing step should be available in this library too. We will try to work on it soon. If it happens that you did any work on that, please share so that we can integrate with it.
Thanks for sharing such insights.
For reference, this seems to be the original paper for Farasa POS tagging.
Thanks for your response and for creating this helpful library. I will try to work on it and share the code if it works.
Thanks for your support @amal-meer
Regarding the problem we had in hand, I wrote a code to do the desegment the segmented tokens but keep the original tags for each segment. I think this is a fair choice. End-user should decide what to keep and what to remove from this list of tags.
Here is a sample output:
What do you think about this output?
I think this is the desired output for this problem, but I wonder if Farasa has another tagger that tags the whole word, as mentioned in the paper. -screenshots from the paper are below-.
That is interesting! I will try to check their APIs. Thanks for mentioning that! Better to keep this open and not commit anything until we come across a consistent approach.
Hey @amal-meer
In their paper, Farasa POS tagging systems authors mentioned that they developed two separate systems for POS tagging. I was expecting that they may include both systems in the same package they distribute through their website. But, unfortunately, there is no option to run the package on the word level or the clitic level! Although they mentioned that they open-sourced both systems, it seems that the word-level package is not available in the main website to download. If you have any idea how to get the second package, it will be great to include it in farasapy. Maybe communicating these concerns by email to farasa authors give some insights?
Other important point mentioned in the paper is that clitic level has better accuracy than word level. So, the current system in hand is more accurate.
I run the code on Java IDE rather than the terminal and followed the code written by Farasa on their website. I noticed that they segment the sentence and then tag it.
ArrayList<String> segOutput = farasa.segmentLine("النص المراد معالجته");
Sentence sentence = farasaPOS.tagLine(segOutput);
I change the code by splitting the sentence by space rather than segmenting and got the tag for the whole word.
This is the main function code
public static void main(String[] args) throws IOException, FileNotFoundException, ClassNotFoundException,
UnsupportedEncodingException, InterruptedException, Exception {
Farasa farasa = new Farasa();
FarasaPOSTagger farasaPOS = new FarasaPOSTagger(farasa);
String s = "ذهب الطالب إلى المدرسة";
// Segmentation
ArrayList<String> segOutput = farasa.segmentLine(s);
System.out.println("Segment output: " + segOutput);
Sentence clitictag = farasaPOS.tagLine(segOutput);
for (Clitic w : clitictag.clitics) {
System.out.println(w.surface + "/" + w.guessPOS + ((w.genderNumber != "") ? "-" + w.genderNumber : "") + " ");
}
// Spletting
ArrayList<String> splitOutput = new ArrayList<>(Arrays.asList(s.split("\\s+")));
System.out.println("\nSplit output: " + splitOutput);
Sentence sentencetag = farasaPOS.tagLine(splitOutput);
for (Clitic w : sentencetag.clitics) {
System.out.println(w.surface + "/" + w.guessPOS + ((w.genderNumber != "") ? "-" + w.genderNumber : "") + " ");
}
}
and this is the output
This is great work @amal-meer, However, I want to let you know that farasapy spawns terminal processes to run farasa binaries. This, although it is tricky, provides some gains in speed. From my side, I tried multiple times to produce the same results on the terminal but, unfortunately, with no success. If you find a way to produce these results on the terminal, please let me know so that I can integrate it with farasapy. Otherwise, I do not think it will be easy to integrate due to that design choice! Nevertheless, you can use other alternatives to achieve your goal such as Py4J [https://www.py4j.org/index.html] which calls java classes from within Python. Keep in mind that you may sacrifice some speed in that approach.
I've tried the terminal but couldn't get any result. I think the desegment solution is the only way to solve this problem in farasapy. I appreciate if you could share or commit the code.
And thanks for your fast response and for suggesting Py4J
That is great!
I will try to integrate it soon. Please stay tuned. Moreover, I found out that there are other options I can add. I will keep you updated.
@amal-meer
I pushed the changes to the repository. You can test them on this notebook:
https://colab.research.google.com/drive/1sFDta9U3t2iXX_PXjpXTl1L-G02EWPmH?usp=sharing
Please let me know if you have any comments. I will, then, do the tests and push to PyPI.
Thanks in advance,.
@amal-meer
Any updates?
I just test the notebook and it works well 👍. Thanks for your effort.
@amal-meer
changes are pushed to PyPI, please upgrade the package and test.
If you do not have any concerns, please close the issue.
Many Thanks in advance,
Everything work fine. One last suggestion is to add an example of tag_segments function to the interactive Google colab code in How to use section of the README file.
Thanks for your time and effort.
Thanks for closing the issue @amal-meer , I already added it to the interactive colab session but forgot to add it to the README.md. I will consider that. Thanks for the suggestion.
Greeting. I am trying to use the POS tagger. I run the following code.
and got the result
sample POS Tagged (interactive) S/S ذهب/V ال+/DET طالب/NOUN-MS إلى/PREP ال+/DET مدرس/NOUN-FS +ة/NSUFF E/E
The problem is that I need the tagger for each individual word without using the segmenter. i.e. without extracting the prefix and suffix. To clarify, I need the same output of the Farasa online demo as shown in the below picture.
Is there a way to have this output?