dice-group / Palmetto

Palmetto is a quality measuring tool for topics
GNU Affero General Public License v3.0
209 stars 36 forks source link

Use Palmetto with different language!! #63

Closed duaaalkhafaje closed 2 years ago

duaaalkhafaje commented 2 years ago

Hello, sorry for the inconvenience again. But I wonder if anyone was able to use Palmetto with another language?! To be honest, I didn't know how to create a new index, I need some help :(

MichaelRoeder commented 2 years ago

I am aware of an index for dutch (it is linked in the Wiki).

It is a little bit hard to help you without knowing the exact issue that you encounter :wink: Please have a look at the description in the wiki. Feel free to ask questions when you struggle with one of the steps.

MichaelRoeder commented 2 years ago

To be more precise: it is easier to help you if you could say at which step you encounter problems. The steps are (headlines from the reference article):

Did you already preprocessed your documents? What exactly did you try? What is the error that you encounter?

MichaelRoeder commented 2 years ago

Hi @duaaalkhafaje do you still need help? Can you please provide more information about your problem (as described above)?

duaaalkhafaje commented 2 years ago

Hi, Mr. @MichaelRoeder I'm really sorry for my late reply, I've been very busy these days.. Thanks for your interest and for your quick reply.. My mind is really confused, I don't know where to start, so I was looking for another way to evaluate my work because I don't have enough time now..

duaaalkhafaje commented 2 years ago

The truth is I did not know how to start adding my data and working on it with Java,, I know how to deal with Java codes, but I do not know why my mind stopped here.. Can you give me examples of the data that I can benefit from in the Arabic language?

MichaelRoeder commented 2 years ago

I see. I thought it might be something urgent. However, if you are busy, you can of course take your time to answer :wink: If you are more used to program in Python, you can also use Gensim. They provide an implementation in Python. However, I never tried it by myself and can not say whether it is good or not. If you prefer Java, we can try to make it work.

My knowledge about the Arabic language is very limited. However, I think we can clarify some of the questions you may have :wink:

In the following, I will assume that

  1. you already have some corpus (i.e., a set of documents) that
  2. you used for generating topics / topic models and that
  3. these documents have been pre-processed in some way.

Please let me know if one of the assumptions is not correct or not clear to you.

1. Which corpus do you want to use for the coherence calculation?

There are two approaches: 1) you can either use the corpus on which you have generated your topics or 2) you use a large, more general corpus. Our results showed that the second approach leads to better results. A classic corpus that you could use is the Wikipedia in the Arabic language (e.g., https://ar.wikipedia.org/wiki/%D8%A7%D9%84%D8%B5%D9%81%D8%AD%D8%A9_%D8%A7%D9%84%D8%B1%D8%A6%D9%8A%D8%B3%D9%8A%D8%A9)

Option 1 has the advantage that your corpus is most probably already pre-processed (as stated above) while option 2 would raise the demand to pre-process the reference corpus (e.g., Wikipedia) in the same way as your original corpus has been pre-processed.

Let me know if you have further questions with respect to this point.

2. Which coherence do you want to use?

If you already have a clear picture which coherence calculation you would like to use, it can help to reduce the work that you have to do. However, you can simply create a position-storing index as it works with all of the important coherence calculations.

3. How to transform the reference corpus?

The easiest way is to transform the corpus in the following form:

this is my first example document note that the terms are simply separated by white spaces
this is a second document
and a third document follows

Note that each of the documents has it's own line. However, some structure of the document has been removed (e.g., paragraphs and even sentences do not play a role in this stage, anymore).

After that, you can simply use the code from the wiki page to 1) create the Lucene index and 2) create the histogram file. Note that this step is not fully language independent but it should work for Arabic as it uses a simple pattern to separate the single words in the text from each other based on white spaces (and punctuation).

duaaalkhafaje commented 2 years ago

Mr. @MichaelRoeder Thank you very much for the detailed explanation, now it is clearer for me and what you said is true.. I started working on it

But I have some beginner's questions 😅

  1. Creating the index is in a new class within a org.aksw.palmetto.corpus.lucene.creation? or what?
  2. when create a list of org.aksw.palmetto.corpus.lucene.creation.IndexableDocument objects, each row represent a document from corpus and pass for it (text, number of tokens) parameters. right?
  3. in File indexDir = // --> give a directory that new index will save in it ?
  4. in Iterator docIterator = // --> can you explain what (an iterator that can iterate over the reference documents) means?

I really appreciate your effort in explaining, thank you very much.

MichaelRoeder commented 2 years ago
  1. Please use the latest version in the master branch. If you can not build it locally, you can find version 0.1.2 at https://hobbitdata.informatik.uni-leipzig.de/homes/mroeder/palmetto
  2. Yes, you can see a simple example in the JUnit test for the index generation. There are three documents in the lines 41-43. However, please note that in practice it might not be feasible to load your complete corpus into memory. In that case, you may have to find a different solution for iterating over the documents (it is not too hard to implement that).
  3. This is the directory in which the index will be created. In the best case you choose an existing, empty directory.
  4. "Iterator" is a design pattern that says that an iterator is something that allows you to go through the elements of a collection one after the other. If you have the documents as a list (like in the test class linked above), you can easily create it by calling the iterator() method. However, if your corpus is too big to fit into memory at once you need a different solution. Let me know whether that is the case or not.
duaaalkhafaje commented 2 years ago

Hi Mr. @MichaelRoeder I know it's late but thank you for your detailed answer