ayman-mezghani / Wikipedia-Graph-Processor

bachelor's semester project
0 stars 0 forks source link

Infer topics using Wikipedia categories #2

Closed mizvol closed 4 years ago

mizvol commented 4 years ago

Infer topics of clusters of Wikipedia pages. Use Wikipedia API to access Category trees for each page in the cluster.

ayman-mezghani commented 4 years ago

I could extract the lists of categories of each page. used those lists to create a map (category:frequency) for each cluster, but I noticed a lot of hidden categories that are useless for topic detection. I need to extract parent categories for each extracted category in order to filter the children of "Hidden Categories". Currently thinking of the most optimal way to filter them out : querying each category at most once. 1) create map {cluster : {category : freq}} 2) create set of all categories of the graph and then query for each element of the set and keep only visible categories. 3) for each cluster (in 1), keep only the categories that exist in the set (found in 2) 4) find a way get topic: maybe by getting most frequent category

mizvol commented 4 years ago

Sounds good. It should give you an overall idea about the topic.

Is there a way to query the entire tree of categories at once?

Unless you can use the API to do that, you could do it using Neo4J. Write a Cypher query that would return all categories for all pages in a cluster (something like that: MATCH (c: Category)<-[BELONGS_TO]-(p: Page) WHERE p IN {node_list} RETURN c.title). That is probably the most efficient way. And that way, we would not abuse the API.

ayman-mezghani commented 4 years ago

since our dumps include the categories, we can get just the categories and yield a directed graph or tree (without trying to find any peaks, so basically just simple extraction) that would represent the entire tree of categories.

I believe using the dumps we already have is a better idea than using the API since the existent wrappers do not allow to distinguish between a normal page and a category page.

ayman-mezghani commented 4 years ago

So I'll do the following:

mizvol commented 4 years ago

something like that: MATCH (c: Category)<-[BELONGS_TO]-(p: Page) WHERE p IN {node_list} RETURN c.title). That is probably the most efficient way. And that way, we would not abuse the API.

Yes. You could use something like the Cypher query I mentioned before.

ayman-mezghani commented 4 years ago

Note to self: use ML to infer subject from categories of cluster

ayman-mezghani commented 4 years ago

Got the cypher query to fetch the categories that are not hidden for a certain page: MATCH (p:Page)-[:BELONGS_TO]->(c:Category) WHERE p.title = "Portsmouth_Naval_Shipyard" AND NOT exists((c)-[:BELONGS_TO]->(:Category {title: "Hidden_categories"})) RETURN c.title

ayman-mezghani commented 4 years ago

The program now uses Neo4j to fetch the categories. Next step is usig TF-IDF and LDA to get cluster topic

ayman-mezghani commented 4 years ago

I implemented the TF-IDF and LDA. But, it's not possible to get back actual words from the LDA output, since the actual words were hashed by the TF-Hasher. I think I'm going to use CountVectorizer-IDF then LDA. CountVectorizer gives the possibility to extract a vocabulary that will be useful to translate the LDA output. I have to do some research about the topic.

ayman-mezghani commented 4 years ago

Made sure that the text is clean by removing the "useless" punctuation such as , ' \ . I didn't remove the - since it is useful to understand some text such as "21st-century". I also used the SparkML StopWordsRemover to do some further text cleaning.

ayman-mezghani commented 4 years ago

Finally understood how to get the words that describe every topic and found a way to assign each page-community the words describing its topic ! There's still some dirt in the results, some irrelevant words that can be considered as stopWords, I will try to "hunt" them and do further cleaning to the data, then I will try to tune the LDA model to maybe get better results (since augmenting with the number of iterations yielded better topic descriptions). This should be done by tomorrow, I'll commit the code to github when it's done. Afterwards, I'll clean the code by putting each step into a function making the code easier to understand.

ayman-mezghani commented 4 years ago

I assigned the LDA topic words to each cluster and created a DF to compare the output of the first method (using the most frequent category) to the new output (from LDA).

Comment on the LDA output: I made LDA extract a number of topics that is equal to the number of clusters. But when assigning topics, not all of them were used and multiple clusters get the same topic (not as much as with the first approach though). We may need to feed more data to the algorithm (summaries maybe ?)