dbpedia / GSoC

Google Summer of Code organization
37 stars 27 forks source link

Fusing the List Extractor and the Table Extractor #6

Closed mgns closed 4 years ago

mgns commented 6 years ago

Description

Currently, there are 2 different projects for extracting triples from lists and from tables. Both project's aim is to extract data from wikipedia pages and to create a dictionary for mapping elements found in those pages. The student has to study how these projects work (how they create dictionaries, how they call for services, etc.) and he has to merge them, in order to create a unified extractor. The student has to restructure both the projects such that both projects use a common dictionary, thus making it easier for the existing projects to be integrated into one. The student can also add a GUI so that it becomes easier for users with little/no knowledge about the project can add triples. The GUI should have a tool that can look up for existing classes and properties from the latest DBpedia ontology. Also, implement other facilities for users perspective (like add more comments, demo that shows all steps, etc.). Also, the student should add support for different languages, so that the extractor can extract triples from different editions (languages) of Wikipedia. This should include support for languages that don't support Latin alphabets (like Greek, Hebrew etc.). Multithreading implementation: try to create threads into extractors in order to make them faster.

Goals

There are two main goals to achieve:

  1. Merge two projects in order to get a unique way to analyze wikipedia structures (lists and tables).
  2. Create a GUI interface to help user. Furthermore it will be helpful adding more comments and tips.

Another aspect that could be studied is how to speed up this analysis' process. The entire work can be reorganized in different threads (this is an additional goal, it's not essential).

Impact

DBpedia will have only one program to extract data from Wikipedia article pages. Furthermore, users will have new facilities, like a GUI or tips on how he could work better with this application.

Warm up tasks

Study parser's code and explain a possible dictionary structure that can be used for both projects. Mockup of GUI interface that has to organize user's work (e.g. how users add new rules or how he can view statistics of domain analysis).

Mentors

Luca Virgili, Krishanu Konar

Keywords

Python, RDF, Java

sachinmalepati commented 6 years ago

Hi, I am facing an issue when i run list extractor project in my mac. When I ran the command "python listExtractor.py s William_Gibson en" I got the following error

Exception in thread "main" java.lang.NoClassDefFoundError: com/machinelinking/main/JSONpediaException
    at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
    at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3139)
    at java.base/java.lang.Class.getMethodsRecursive(Class.java:3280)
    at java.base/java.lang.Class.getMethod0(Class.java:3266)
    at java.base/java.lang.Class.getMethod(Class.java:2063)
    at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:57)
Caused by: java.lang.ClassNotFoundException: com.machinelinking.main.JSONpediaException
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:563)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:496)
    ... 6 more

I think the main problem is in execution of the jsonpedia_wrapper.jar I am running it on macOS 10.13.2 and java version is 9.0.1 I tried googling the issue but didn't get any satisfactory results (I tried setting classpath to the jar location). Please help!

mgns commented 6 years ago

Please report this issue directly at https://github.com/dbpedia/list-extractor, as the developers not necessarily see this comment here.

sachinmalepati commented 6 years ago

Hey everyone, I have worked on the warm-up tasks and I want to show them.

  1. The possible dictionary structure can be the following,

    Class : { 
            Headers/Sections : {
                       lang1 : [ ],   #list of headers to explore
                       lang2 : [ ],
                       .. ,
                       ..
               }, 
            Ontology : {
                       lang1 : { " " : OntologyProperty,  #Ontology mappings
                                     " " : OntologyProperty, 
                                     ..
                                   }
                       lang2 : { .. }
            }
    }

    Class represents the type/domain of resource list of headers to explore may not be useful for table-extractor but is necessary for list-extractor. Both the projects should be restructured to make use of the above possible dictionary.

  2. For the GUI , I have done some little research and found out Django to be helpful, so I learned on the fly and developed a mockup GUI with some functionalities even. This GUI is with respect to the table-extractor project and I also want to implement the same flow for unified extractor.

    Step 1: User needs to enter details of the resource, language, others and should click on the "explore" button.

step_1

Step 2: Then it will show all the headers/sections found in the tables and also show the mappings if present (basically showing the contents of domain_settings.py file), on clicking "edit mappings", user can add or edit the mappings (which is not implemented as of now). It even shows example mappings already present in the dictionary.

step_2

Step 3: Later on clicking "Extract Triples" will generate the corresponding .ttl file

Here is the link of my work https://github.com/sachinmalepati/table-extractor/tree/master/gui/gui_app Please provide a feedback.

Thanks, Sachin Malepati.

krishh-konar commented 5 years ago

Continuing the last year's project, we plan on adding the following things to this years project. Following are the initial requirements:

@lucav48 and me will mentor this project.

shubhamtripathi-work commented 5 years ago

Hi, I am facing an issue when i run list extractor project in my mac. When I ran the command "python listExtractor.py s William_Gibson en" I got the following error

Exception in thread "main" java.lang.NoClassDefFoundError: com/machinelinking/main/JSONpediaException
  at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
  at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3139)
  at java.base/java.lang.Class.getMethodsRecursive(Class.java:3280)
  at java.base/java.lang.Class.getMethod0(Class.java:3266)
  at java.base/java.lang.Class.getMethod(Class.java:2063)
  at org.eclipse.jdt.internal.jarinjarloader.JarRsrcLoader.main(JarRsrcLoader.java:57)
Caused by: java.lang.ClassNotFoundException: com.machinelinking.main.JSONpediaException
  at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:466)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:563)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:496)
  ... 6 more

I think the main problem is in execution of the jsonpedia_wrapper.jar I am running it on macOS 10.13.2 and java version is 9.0.1 I tried googling the issue but didn't get any satisfactory results (I tried setting classpath to the jar location). Please help!

Did you find a solution for this? I am facing the same issue. I have a MacBook Air.

krishh-konar commented 5 years ago

This is due to the mismatch in Java version. This is still an existing issue, which needs to be resolved by isolating the underlying infrastructure. The plan for this years GSoC would be isolating this, probably by containerising it.

In the meantime, you can have another version of Java 8, and run this using that.

shubhamtripathi-work commented 5 years ago

Thank You for the clarification. I did check the main repository and found your comment suggesting to use the Java 8. Sorry for not notifying.

I wil add to the documentation if it hasn’t been already. Again, thanks for the clarification.

On Thu, 7 Feb 2019 at 1:11 PM, Krishanu notifications@github.com wrote:

This is due to the mismatch in Java version. This is still an existing issue, which needs to be resolved by isolating the underlying infrastructure. The plan for this years GSoC would be isolating this, probably by containerising it.

In the meantime, you can have another version of Java 8, and run this using that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dbpedia/GSoC/issues/6#issuecomment-461316957, or mute the thread https://github.com/notifications/unsubscribe-auth/AsjP8E1nDWHrOrI0IustLaJrWw8btXnbks5vK9i3gaJpZM4Rpnf4 .

krishh-konar commented 5 years ago

No issues. It would be more useful right now if you look into the actual list and table extractor repos, it might give you a better idea on how the extractor really works. The plan for this project includes a decent code restructuring, and it would be helpful for you to understand the core logic behind the extractors.