dice-group / LIMES

Link Discovery Framework for Metric Spaces.
https://limes.demos.dice-research.org/
GNU Affero General Public License v3.0
129 stars 54 forks source link

How to map a single dataset containing multiple sources to itself? #255

Open KonradHoeffner opened 2 years ago

KonradHoeffner commented 2 years ago

Is it possible to use LIMES with more than two sources which are all included in the same file? The sources should be mapped to each other but of course I don't want to map a source to itself and I also don't want to have duplicate pairs (A,B) and (B,A). To clarify with an example, lets say I have a class :Country with many instances and each country has a population of individuals. All of this data is in the same file countries.ttl. Now I want to find out, which individuals live in more than one country.

:Germany a :Country;
 rdfs:label "Germany".

:Azerbaijan a :Country;
 rdfs:label "Azerbaijan".

:person123 a :Person;
 rdfs:label "Alex Müller";
 :country :Germany.

:person 456 a :Person;
 rdfs:label "Alex Mueller";
 :country :Azerbaijan.

This can be done in the following manner, declaring source and target alike:

    <SOURCE>
        <ID>c1</ID>
        <ENDPOINT>countries.ttl</ENDPOINT>
        <VAR>?c1</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c1 a :Person; :country ?x.</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <TYPE>TURTLE</TYPE>
    </SOURCE>

    <TARGET>
        <ID>c2</ID>
        <ENDPOINT>countries.ttl</ENDPOINT>
        <VAR>?c2</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c2 a :Person; :country ?y.</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <TYPE>TURTLE</TYPE>
    </TARGET>

   <METRIC>trigrams(c1.label,c2.label)</METRIC>

However this will generate a false match for every person to itself, and also it will also match each pair twice in both directions. I would like to add a restriction like "STR(?x) < STR(?y)" but it seems like one cannot reference variables from the source in the restriction of the target. A workaround is to throw away all matches with score exactly 1.0 but this is wasteful on resources and also discards correct matches that happen to be exactly equal. Also, this will map people in a country to others in the same country which is not intended.

    <ACCEPTANCE>
        <THRESHOLD>1</THRESHOLD>
        <FILE>exact.ttl</FILE>
        <RELATION>owl:sameAs</RELATION>
    </ACCEPTANCE>

    <REVIEW>
        <THRESHOLD>0.8</THRESHOLD>
        <FILE>close.ttl</FILE>
        <RELATION>owl:sameAs</RELATION>
    </REVIEW>

Another way is to perform postprocessing to remove all duplicate and self matches but that seems to be inefficient in both developer and execution time.

Lastly, I could write a script which would enumerate all n*(n-1)/2 unique non self-matching pairs and generate as many limes configuration files but that has its own problems.

Is there any way to solve this task efficiently using LIMES or do I need to use one of the mentioned imperfect options?

KonradHoeffner commented 2 years ago

Thanks to @MSherif I had partial success with MINUS(TRIGRAMS(c1.label,c2.label)0.5,EXACTMATCH(c1.x,c2.y)|1) however that still contains duplicates and it seems like those cannot be removed with limes as there is no "less than" operator.

MSherif commented 2 years ago

We added the new lessThan String measure. Please test it and close the issue if it is OK.

KonradHoeffner commented 2 years ago

Unfortunately it doesn't seem to work for me. Did I make a mistake with the combined metric? I don't really understand the documentation on what exactly MINUS, MAX and LESS_THAN output.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LIMES SYSTEM "limes.dtd">
<LIMES>
    <PREFIX>
        <NAMESPACE>http://hitontology.eu/ontology/</NAMESPACE>
        <LABEL>hito</LABEL>
    </PREFIX>
    <PREFIX>
        <NAMESPACE>http://www.w3.org/1999/02/22-rdf-syntax-ns#</NAMESPACE>
        <LABEL>rdf</LABEL>
    </PREFIX>
    <PREFIX>
        <NAMESPACE>http://www.w3.org/2000/01/rdf-schema#</NAMESPACE>
        <LABEL>rdfs</LABEL>
    </PREFIX>
    <PREFIX>
        <NAMESPACE>http://www.w3.org/2002/07/owl#</NAMESPACE>
        <LABEL>owl</LABEL>
    </PREFIX>
    <PREFIX>
        <NAMESPACE>http://www.w3.org/2004/02/skos/core#</NAMESPACE>
        <LABEL>skos</LABEL>
    </PREFIX>

    <SOURCE>
        <ID>c1</ID>
        <ENDPOINT>https://hitontology.eu/sparql</ENDPOINT>
        <VAR>?c1</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c1 a hito:FeatureClassified</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <PROPERTY>hito:featureCatalogue RENAME cat</PROPERTY>
        <OPTIONAL_PROPERTY>rdfs:comment AS nolang->lowercase->regularalphabet RENAME comment</OPTIONAL_PROPERTY>
        <TYPE>SPARQL</TYPE>
    </SOURCE>

    <TARGET>
        <ID></ID>
        <ENDPOINT>https://hitontology.eu/sparql</ENDPOINT>
        <VAR>?c2</VAR>
        <PAGESIZE>-1</PAGESIZE>
        <RESTRICTION>?c2 a hito:FeatureClassified</RESTRICTION>
        <PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
        <PROPERTY>hito:featureCatalogue RENAME cat</PROPERTY>
        <OPTIONAL_PROPERTY>rdfs:comment AS nolang->lowercase->regularalphabet RENAME comment</OPTIONAL_PROPERTY>
        <TYPE>SPARQL</TYPE>
    </TARGET>

<METRIC>MINUS(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>

    <ACCEPTANCE>
        <THRESHOLD>1</THRESHOLD>
        <FILE>catalogue-exact.ttl</FILE>
        <RELATION>skos:closeMatch</RELATION>
    </ACCEPTANCE>

    <REVIEW>
        <THRESHOLD>0.5</THRESHOLD>
        <FILE>catalogue-close.ttl</FILE>
        <RELATION>skos:closeMatch</RELATION>
    </REVIEW>

    <EXECUTION>
        <REWRITER>default</REWRITER>
        <PLANNER>default</PLANNER>
        <ENGINE>default</ENGINE>
    </EXECUTION>

    <OUTPUT>CSV</OUTPUT>
</LIMES>

Despite saying that c1.cat should be less than c2.cat, the resulting catalogue-close.ttl still contains symmetric pairs:

<http://hitontology.eu/ontology/WhoDhiSelfMonitoringOfHealthOrDiagnosticDataByClient>   <http://hitontology.eu/ontology/WhoDhiRemoteMonitoringOfClientHealthOrDiagnosticDataByProvider> 0.618421052631579
<http://hitontology.eu/ontology/WhoDhiNonRoutineDataCollectionAndManagement>    <http://hitontology.eu/ontology/WhoDhiRoutineHealthIndicatorDataCollectionAndManagement>    0.6129032258064516
<http://hitontology.eu/ontology/WhoDhiManageCertificationregistrationOfHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiMapLocationOfHealthcareProviders> 0.5245901639344263
<http://hitontology.eu/ontology/WhoDhiMapLocationOfHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiManageCertificationregistrationOfHealthcareProviders> 0.5245901639344263
<http://hitontology.eu/ontology/WhoDhiRemoteMonitoringOfClientHealthOrDiagnosticDataByProvider> <http://hitontology.eu/ontology/WhoDhiSelfMonitoringOfHealthOrDiagnosticDataByClient>   0.618421052631579
<http://hitontology.eu/ontology/WhoDhiTransmitNonroutineHealthEventAlertsToHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiTransmitRoutinePayrollPaymentToHealthcareProviders>   0.5540540540540541
<http://hitontology.eu/ontology/WhoDhiTransmitRoutinePayrollPaymentToHealthcareProviders>   <http://hitontology.eu/ontology/WhoDhiTransmitNonroutineHealthEventAlertsToHealthcareProviders> 0.5540540540540541
<http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToHealthcareProviders>  <http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToClientsForHealthServices> 0.52
<http://hitontology.eu/ontology/WhoDhiRoutineHealthIndicatorDataCollectionAndManagement>    <http://hitontology.eu/ontology/WhoDhiNonRoutineDataCollectionAndManagement>    0.6129032258064516
<http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToClientsForHealthServices> <http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToHealthcareProviders>  0.52
MSherif commented 2 years ago

Min(m1, m2) Computes the intersection of the two mappings m1 and m2. In case an entry (i.e., link) exists in both mappings the minimal similarity is taken.

Max(m1, m2) Computes the union of the two mappings m1 and m2. In case an entry (i.e., link) exists in both mappings the maximal similarity is taken.

MINUS(m1, m2) Computes the difference of two mappings. i.e. the set difference m1 - m2

MSherif commented 2 years ago

Plz try <METRIC>MIN(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>

KonradHoeffner commented 2 years ago

Thank you for the detailed explanation, this is extremely helpful! Could you add this to the official documentation at http://dice-group.github.io/LIMES/#/user_manual/configuration_file/defining_link_specifications?id=boolean-operations? I know what minimum, maximum and set difference are but the interaction with the thresholds was not clear to me. However what I still don't know is: What is the similarity score output of the MINUS operator? The ones from the first parameter? And what if something is below the threshold?

KonradHoeffner commented 2 years ago

Unfortunately, <METRIC>MIN(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC> does not do the trick. If I replace this in the full specification given above (you can run it yourself to verify if you want), it gives a bunch of identical results:

<http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness>   <http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness>   1.0
<http://hitontology.eu/ontology/EhrSfmSupportForResearchProtocolsRelativeToIndividualPatientCare>   <http://hitontology.eu/ontology/EhrSfmSupportForResearchProtocolsRelativeToIndividualPatientCare>   1.0
<http://hitontology.eu/ontology/BbDisplayVitalParametersFromMonitoringDevices>  <http://hitontology.eu/ontology/BbDisplayVitalParametersFromMonitoringDevices>  1.0
<http://hitontology.eu/ontology/WhoDhiTargetedClientCommunication>  <http://hitontology.eu/ontology/WhoDhiTargetedClientCommunication>  1.0
[...]

However this should not be possible, because for example http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness only has one catalogue, and this cannot be smaller than itself, as specified in LESS_THAN(c1.cat,c2.cat).

Output of LIMES

$ limes test-sparql.xml                   
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
09:13:15.813 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:125 - Checking for file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.821 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:128 - Found cached data. Loading data from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.859 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:134 - Cached data loaded successfully from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.860 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:135 - Size = 618
09:13:15.860 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:125 - Checking for file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.860 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:128 - Found cached data. Loading data from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.873 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:134 - Cached data loaded successfully from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.874 [main] [] INFO  org.aksw.limes.core.io.cache.HybridCache:135 - Size = 618
09:13:16.205 [main] [] WARN  org.apache.sis.system:228 - The “SIS_DATA” environment variable is not set.
09:13:17.171 [main] [] INFO  org.aksw.limes.core.controller.Controller:237 - Mapping task finished in 1218 ms
09:13:17.175 [main] [] INFO  org.aksw.limes.core.controller.Controller:241 - Mapping size: 620 (accepted) + 1520 (need verification) = 2140 (total)
09:13:17.176 [main] [] INFO  org.aksw.limes.core.controller.Controller:108 - Writing result files...
09:13:17.176 [main] [] INFO  org.aksw.limes.core.io.serializer.SerializerFactory:32 - Getting serializer with name CSV
09:13:17.199 [main] [] INFO  org.aksw.limes.core.controller.Controller:111 - Writing statistics file...
MSherif commented 2 years ago

Thank you for the detailed explanation, this is extremely helpful! Could you add this to the official documentation at http://dice-group.github.io/LIMES/#/user_manual/configuration_file/defining_link_specifications?id=boolean-operations? I know what minimum, maximum, and set differences are but the interaction with the thresholds was not clear to me. However what I still don't know is: What is the similarity score output of the MINUS operator? The ones from the first parameter? And what if something is below the threshold?

Actually, the MIN(m1, m2) is the entries (i.e., links) with minimum similarities in both m1 and m2, where nonexisting entries in both m1 and m2 are assumed to have a similarity of 0. Therefore, if one link l only exists in one m1 for instance, then we conceder that m2 contains the same link l with a similarity of 0. Therefore, we do not return l as it would have the minimum similarity of 0. The MAX(m1, m2) has the same semantics. MINUS(m1,m2) will only return links from m1 with their respective similarities, only in case such links do not exist in m2.

MSherif commented 2 years ago

Done updating the LIMES docs