Closed duaaalkhafaje closed 2 years ago
I am aware of an index for dutch (it is linked in the Wiki).
It is a little bit hard to help you without knowing the exact issue that you encounter :wink: Please have a look at the description in the wiki. Feel free to ask questions when you struggle with one of the steps.
To be more precise: it is easier to help you if you could say at which step you encounter problems. The steps are (headlines from the reference article):
Did you already preprocessed your documents? What exactly did you try? What is the error that you encounter?
Hi @duaaalkhafaje do you still need help? Can you please provide more information about your problem (as described above)?
Hi, Mr. @MichaelRoeder I'm really sorry for my late reply, I've been very busy these days.. Thanks for your interest and for your quick reply.. My mind is really confused, I don't know where to start, so I was looking for another way to evaluate my work because I don't have enough time now..
The truth is I did not know how to start adding my data and working on it with Java,, I know how to deal with Java codes, but I do not know why my mind stopped here.. Can you give me examples of the data that I can benefit from in the Arabic language?
I see. I thought it might be something urgent. However, if you are busy, you can of course take your time to answer :wink: If you are more used to program in Python, you can also use Gensim. They provide an implementation in Python. However, I never tried it by myself and can not say whether it is good or not. If you prefer Java, we can try to make it work.
My knowledge about the Arabic language is very limited. However, I think we can clarify some of the questions you may have :wink:
In the following, I will assume that
Please let me know if one of the assumptions is not correct or not clear to you.
There are two approaches: 1) you can either use the corpus on which you have generated your topics or 2) you use a large, more general corpus. Our results showed that the second approach leads to better results. A classic corpus that you could use is the Wikipedia in the Arabic language (e.g., https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9)
Option 1 has the advantage that your corpus is most probably already pre-processed (as stated above) while option 2 would raise the demand to pre-process the reference corpus (e.g., Wikipedia) in the same way as your original corpus has been pre-processed.
Let me know if you have further questions with respect to this point.
If you already have a clear picture which coherence calculation you would like to use, it can help to reduce the work that you have to do. However, you can simply create a position-storing index as it works with all of the important coherence calculations.
The easiest way is to transform the corpus in the following form:
this is my first example document note that the terms are simply separated by white spaces
this is a second document
and a third document follows
Note that each of the documents has it's own line. However, some structure of the document has been removed (e.g., paragraphs and even sentences do not play a role in this stage, anymore).
After that, you can simply use the code from the wiki page to 1) create the Lucene index and 2) create the histogram file. Note that this step is not fully language independent but it should work for Arabic as it uses a simple pattern to separate the single words in the text from each other based on white spaces (and punctuation).
Mr. @MichaelRoeder Thank you very much for the detailed explanation, now it is clearer for me and what you said is true.. I started working on it
But I have some beginner's questions 😅
org.aksw.palmetto.corpus.lucene.creation
? or what?org.aksw.palmetto.corpus.lucene.creation.IndexableDocument
objects,
each row represent a document from corpus and pass for it (text, number of tokens) parameters. right?I really appreciate your effort in explaining, thank you very much.
iterator()
method. However, if your corpus is too big to fit into memory at once you need a different solution. Let me know whether that is the case or not.Hi Mr. @MichaelRoeder I know it's late but thank you for your detailed answer
Hello, sorry for the inconvenience again. But I wonder if anyone was able to use Palmetto with another language?! To be honest, I didn't know how to create a new index, I need some help :(