AGDISTIS fails with own index

ghost commented 7 years ago

Hi everyone,

right now I am trying to get AGDISTIS to work with an index different from dbpedia. For test purposes, I have created a tiny custom index and run AGDISTIS on it. But it does not return a proper URI of the disambiguated entity. It just returns the text/label of the entity instead.

My approach so far:

Create three files: labels_en.ttl, instance_types_en.ttl and en_surface_forms.tsv. I have oriented myself to your DBpedia 2014 index example from the wiki. They look like this:

labels_en.ttl: <http://www.technologyreview.com/s/602283> <http://www.w3.org/2000/01/rdf-schema#label> "QuantumComputer"@en .

instance_types_en.ttl: <http://www.technologyreview.com/s/602283> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/ontology/InformationAppliance> .

en_surface_forms.tsv: http://www.technologyreview.com/s/602283 Quantum Computer Computer

Run mvn exec:java -Dexec.mainClass="org.aksw.agdistis.util.TripleIndexCreator" to create the actual index.

Modify properties file:

nodeType=http://www.technologyreview.com/s/
edgeType=http://dbpedia.org/ontology/
baseURI =http://www.technologyreview.com/s
threshholdTrigram=0.5

Run AGDISTIS (mergeT branch) with the following code (it gets the entity labels and positions from the Stanford CoreNLP MentionsAnnotator):

public void disambiguateEntities() throws InterruptedException, IOException {
    NEDAlgo_HITS agdistis = new NEDAlgo_HITS();
    Document agdistisDocument = new Document();
    ArrayList<NamedEntityInText> entityList = new ArrayList<NamedEntityInText>();

    for (final CoreMap sentence : document.get(CoreAnnotations.SentencesAnnotation.class)) {
        for (final CoreMap entityMention : sentence.get(CoreAnnotations.MentionsAnnotation.class)) {
            entityList.add(
                new NamedEntityInText(entityMention.get(CoreAnnotations.CharacterOffsetBeginAnnotation.class),
                        entityMention.get(CoreAnnotations.TextAnnotation.class).length(),
                        entityMention.get(CoreAnnotations.TextAnnotation.class)));

        }
    }

    NamedEntitiesInText namedEntitiesInText = new NamedEntitiesInText(entityList);
    DocumentText documentText = new DocumentText(text);

    agdistisDocument.addText(documentText);
    agdistisDocument.addNamedEntitiesInText(namedEntitiesInText);

    agdistis.run(agdistisDocument, null);

    NamedEntitiesInText namedEntities = agdistisDocument.getNamedEntitiesInText();

    for (NamedEntityInText namedEntity : namedEntities) {
        String disambiguatedURL = namedEntity.getNamedEntityUri();
        this.results.put(namedEntity.getStartPos(), disambiguatedURL);
    }
}

Now instead of returning QuantumComputer -> http://www.technologyreview.com/s/602283 it returns QuantumComputer -> QuantumComputer.

Is this an issue with my custom index? Because if I use the 2016 dbpedia standard index my implementation is working.

I would be very happy if you could provide an explanation of how to use a custom index that is a little bit more detailed than in the GitHub wiki. :-)

Thank you in advance!

PS: In case the new custom index will be working in the future, how can I add new Triples to the already existing index? With the addDocumentToIndex method in TripleIndexCreator.java?

RicardoUsbeck commented 7 years ago

Hi, thanks for pen-testing our algorithm. First, I miss an ontologies file. AGDISTIS needs to construct a graph, i.e. you must have triples of the form (URI, predicate, URI), e.g. mappingbased_objects.ttl from DBpedia. Second, I will try out your code and see, why it is returning a string and not a URI. Third, you should be able to extend the index but we might need to adapt it to write on an existing index.

ghost commented 7 years ago

Thanks for your reply, @RicardoUsbeck .

Could you kindly explain what ontologies file you mean? What contents does it need to have? Generally, what are the minimum required files and triples respectively to get a custom index up an running? I am completely new to the semantic web science, so please forgive my lack of knowledge.

I created a mappingbased_properties_en.ttl file like in the 2014 dbpedia dataset with the following content:

...

<http://www.technologyreview.com/resource/602283> <http://xmlns.com/foaf/0.1/name> "QuantumComputer"@en .
<http://www.technologyreview.com/resource/602283> <http://www.technologyreview.com/ontology/field> <http://www.technologyreview.com/resource/computing> .

...

I also changed the edgeType property to edgeType=http://www.technologyreview.com/ontology/ and built a new index. Unfortunately, the overall behaviour did not change. I still receive only the label string.

Edit: I tried to completely simulate the dbpeda 2014 index by creating all the files it also has (of course with less entires) and only exchanged the dbpedia URI inside. It still does not work. Is there a minimum amount of triples necessary maybe?

Regarding index expansion: I would even be happy, if you could provide me a hint where to adapt the code, so I can try it on my own.

Just to put things into perspective: My use case is to have a small index with a domain-specific ontology. I want to disambiguate on that index. And if I find new entities with Stanford CoreNLP, I want to add those new entities and their properties to the index.

ghost commented 7 years ago

I have made a new observation.

I switched back to the default DBpedia 2016 index, but forgot to change my AGDISTIS properties file. So the entries for nodeType, edgeType and baseURI still were http://example.com/resource/ etc. instead of http://dbpedia.org/resource/ and so on.

With these wrong properties, I get the same error as before - the returned URI is just the label/name string of the entity, although I am using the default DBpedia index.

But I am quite sure, that my previous properties have been correct, because as I said in the edited paragraph of my previous post, I just exchanged the "dbpedia.org" string with "example.com" in the properties file and .ttl files, as well.

Does this observation help to solve the problem?

RicardoUsbeck commented 7 years ago

I will take a look at your data today or Monday. Could you please upload you example files?

ghost commented 7 years ago

Sure, you will find them here: https://drive.google.com/drive/folders/0BycW_RxvAHdzZkhmS19vTGdFT1k?usp=sharing

The provided properties file is from my second test.

Thank you very much for your time!

ghost commented 7 years ago

Hi @RicardoUsbeck ,

I really do not want to rush you, but have you already looked at the data or the problem, respectively?

DiegoMoussallem commented 7 years ago

@Phauly1, @RicardoUsbeck is dealing with other things regarding AGDISTIS. So, I'm here to help you. The problem was once AGDISTIS provides a list for avoiding bad URIs or for filtering named entities instead of collection common entities, you forgot to include the types inside whileList. You have two options or you comment the respective line inside CandidateUtil.java(line 152) or you include the types within the whiteList's file. In the upcoming release of AGDISTIS, we set it as a parameter not a list anymore. Therefore, we hope to avoid this kind of problem in the future. In addition, I created a test class for your case. https://hastebin.com/onurozijen.vbs . Let me known if you need something else otherwise, I will close this issue.

ghost commented 7 years ago

Hi @DiegoMoussallem ,

I have added the two new instance types to the whiteList.txt and your test method is working. That is awesome, thank you!

But I have a few other related questions and it would be great if you can answer them.

Do I have to strictly follow the DBpedia conventions when creating a new index? So, do I need all the .ttl files like disambiguations.ttl, redirects_transitive_en.ttl and so on or are labels_en.ttl and instance_types_en.ttlenough? Can I even create my own .ttl files with custom properties/predicates?
@RicardoUsbeck said that it might be necessary to adapt the AGDISTIS code in order to extend an already existing index with new triples (at runtime). How could I do that? Otherwise, is it possible to run AGDISTIS with two separate indexes in parallel?

Thank you!

DiegoMoussallem commented 7 years ago

@Phauly1 nice you have tried and it has worked with you!

1 - It is not exactly the DBpedia's convention, it is structured data and graphs. redirects_transitive_en.ttl and disambiguation are important. For instance, disambiguation file allows dealing with a very ambiguity mentions like "German" http://dbpedia.org/page/German. So AGDISTIS can go through all entities and say which one is the correct. Also, redirects_transitives enables to walk more optimized in the graph avoiding incorrect entities. So. I would suggest you have a look at Knowledge Graphs literature for a good understanding.

2 - if you wish to run two indexes in parallel and each one comes from in a knowledge base i.e a different graph. For instance, YAGO and DBpedia. I would suggest you comment that line(Whitelist). Also, you would have to create another index parameter e.g index2=indexbyPhaulh1 along with an appropriated java code for it just replicating.

I hope I have answered your questions.

RicardoUsbeck commented 7 years ago

Thanks for the solution Diego.

@Phauly1 For 1) the main important thing is that it is a well structured graph and you have enough surface forms, i.e. rdfs:label properties for each entity. For 2) additionally to the suggestion by Diego, you could implement and test a method to include new triples at runtime. However, I am not available until March for such implementations. Feel free to do so and come back for questions.

ghost commented 7 years ago

Thank you for your answers. That helps me a lot.

For now, I will try the following:

I am going to use the standard DBpedia index as it is. Then I will create a second custom index with my own ontology and try to use both in parallel. And if there are new entities found by CoreNLP, they will be added to the custom index. In that way, AGDISTIS should disambiguate on two different indexes and only return an entity from my custom index, if it is not found in DBpedia.

(@DiegoMoussallem Could you explain in little more detail what you mean with: "Also, you would have to create another index parameter e.g index2=indexbyPhaulh1 along with an appropriated java code for it just replicating."?)

I will use this github issue for further questions and really appreciate your support so far.

DiegoMoussallem commented 7 years ago

Hi @Phauly1, I meant you have to create in the agdistis.property file another line pointing to the new index directory. Also, to create another TripleIndex.java unless you maintain the same structure of DBpedia Index, it is not necessary.

dice-group / AGDISTIS

AGDISTIS fails with own index #36