Open KonradHoeffner opened 2 years ago
Thanks to @MSherif I had partial success with MINUS(TRIGRAMS(c1.label,c2.label)0.5,EXACTMATCH(c1.x,c2.y)|1)
however that still contains duplicates and it seems like those cannot be removed with limes as there is no "less than" operator.
We added the new lessThan
String measure. Please test it and close the issue if it is OK.
Unfortunately it doesn't seem to work for me. Did I make a mistake with the combined metric? I don't really understand the documentation on what exactly MINUS, MAX and LESS_THAN output.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE LIMES SYSTEM "limes.dtd">
<LIMES>
<PREFIX>
<NAMESPACE>http://hitontology.eu/ontology/</NAMESPACE>
<LABEL>hito</LABEL>
</PREFIX>
<PREFIX>
<NAMESPACE>http://www.w3.org/1999/02/22-rdf-syntax-ns#</NAMESPACE>
<LABEL>rdf</LABEL>
</PREFIX>
<PREFIX>
<NAMESPACE>http://www.w3.org/2000/01/rdf-schema#</NAMESPACE>
<LABEL>rdfs</LABEL>
</PREFIX>
<PREFIX>
<NAMESPACE>http://www.w3.org/2002/07/owl#</NAMESPACE>
<LABEL>owl</LABEL>
</PREFIX>
<PREFIX>
<NAMESPACE>http://www.w3.org/2004/02/skos/core#</NAMESPACE>
<LABEL>skos</LABEL>
</PREFIX>
<SOURCE>
<ID>c1</ID>
<ENDPOINT>https://hitontology.eu/sparql</ENDPOINT>
<VAR>?c1</VAR>
<PAGESIZE>-1</PAGESIZE>
<RESTRICTION>?c1 a hito:FeatureClassified</RESTRICTION>
<PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
<PROPERTY>hito:featureCatalogue RENAME cat</PROPERTY>
<OPTIONAL_PROPERTY>rdfs:comment AS nolang->lowercase->regularalphabet RENAME comment</OPTIONAL_PROPERTY>
<TYPE>SPARQL</TYPE>
</SOURCE>
<TARGET>
<ID></ID>
<ENDPOINT>https://hitontology.eu/sparql</ENDPOINT>
<VAR>?c2</VAR>
<PAGESIZE>-1</PAGESIZE>
<RESTRICTION>?c2 a hito:FeatureClassified</RESTRICTION>
<PROPERTY>rdfs:label AS nolang->lowercase->regularalphabet RENAME label</PROPERTY>
<PROPERTY>hito:featureCatalogue RENAME cat</PROPERTY>
<OPTIONAL_PROPERTY>rdfs:comment AS nolang->lowercase->regularalphabet RENAME comment</OPTIONAL_PROPERTY>
<TYPE>SPARQL</TYPE>
</TARGET>
<METRIC>MINUS(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>
<ACCEPTANCE>
<THRESHOLD>1</THRESHOLD>
<FILE>catalogue-exact.ttl</FILE>
<RELATION>skos:closeMatch</RELATION>
</ACCEPTANCE>
<REVIEW>
<THRESHOLD>0.5</THRESHOLD>
<FILE>catalogue-close.ttl</FILE>
<RELATION>skos:closeMatch</RELATION>
</REVIEW>
<EXECUTION>
<REWRITER>default</REWRITER>
<PLANNER>default</PLANNER>
<ENGINE>default</ENGINE>
</EXECUTION>
<OUTPUT>CSV</OUTPUT>
</LIMES>
Despite saying that c1.cat should be less than c2.cat, the resulting catalogue-close.ttl still contains symmetric pairs:
<http://hitontology.eu/ontology/WhoDhiSelfMonitoringOfHealthOrDiagnosticDataByClient> <http://hitontology.eu/ontology/WhoDhiRemoteMonitoringOfClientHealthOrDiagnosticDataByProvider> 0.618421052631579
<http://hitontology.eu/ontology/WhoDhiNonRoutineDataCollectionAndManagement> <http://hitontology.eu/ontology/WhoDhiRoutineHealthIndicatorDataCollectionAndManagement> 0.6129032258064516
<http://hitontology.eu/ontology/WhoDhiManageCertificationregistrationOfHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiMapLocationOfHealthcareProviders> 0.5245901639344263
<http://hitontology.eu/ontology/WhoDhiMapLocationOfHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiManageCertificationregistrationOfHealthcareProviders> 0.5245901639344263
<http://hitontology.eu/ontology/WhoDhiRemoteMonitoringOfClientHealthOrDiagnosticDataByProvider> <http://hitontology.eu/ontology/WhoDhiSelfMonitoringOfHealthOrDiagnosticDataByClient> 0.618421052631579
<http://hitontology.eu/ontology/WhoDhiTransmitNonroutineHealthEventAlertsToHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiTransmitRoutinePayrollPaymentToHealthcareProviders> 0.5540540540540541
<http://hitontology.eu/ontology/WhoDhiTransmitRoutinePayrollPaymentToHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiTransmitNonroutineHealthEventAlertsToHealthcareProviders> 0.5540540540540541
<http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToHealthcareProviders> <http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToClientsForHealthServices> 0.52
<http://hitontology.eu/ontology/WhoDhiRoutineHealthIndicatorDataCollectionAndManagement> <http://hitontology.eu/ontology/WhoDhiNonRoutineDataCollectionAndManagement> 0.6129032258064516
<http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToClientsForHealthServices> <http://hitontology.eu/ontology/WhoDhiTransmitOrManageIncentivesToHealthcareProviders> 0.52
Min(m1, m2)
Computes the intersection of the two mappings m1
and m2
. In case an entry (i.e., link) exists in both mappings the minimal similarity is taken.
Max(m1, m2)
Computes the union of the two mappings m1
and m2
. In case an entry (i.e., link) exists in both mappings the maximal similarity is taken.
MINUS(m1, m2)
Computes the difference of two mappings. i.e. the set difference m1 - m2
Plz try <METRIC>MIN(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>
Thank you for the detailed explanation, this is extremely helpful! Could you add this to the official documentation at http://dice-group.github.io/LIMES/#/user_manual/configuration_file/defining_link_specifications?id=boolean-operations? I know what minimum, maximum and set difference are but the interaction with the thresholds was not clear to me. However what I still don't know is: What is the similarity score output of the MINUS operator? The ones from the first parameter? And what if something is below the threshold?
Unfortunately, <METRIC>MIN(MAX(MAX(TRIGRAMS(c1.label,c2.label),TRIGRAMS(c1.label,c2.comment)),TRIGRAMS(c1.comment,c2.comment))|0.5,LESS_THAN(c1.cat,c2.cat)|1)</METRIC>
does not do the trick. If I replace this in the full specification given above (you can run it yourself to verify if you want), it gives a bunch of identical results:
<http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness> <http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness> 1.0
<http://hitontology.eu/ontology/EhrSfmSupportForResearchProtocolsRelativeToIndividualPatientCare> <http://hitontology.eu/ontology/EhrSfmSupportForResearchProtocolsRelativeToIndividualPatientCare> 1.0
<http://hitontology.eu/ontology/BbDisplayVitalParametersFromMonitoringDevices> <http://hitontology.eu/ontology/BbDisplayVitalParametersFromMonitoringDevices> 1.0
<http://hitontology.eu/ontology/WhoDhiTargetedClientCommunication> <http://hitontology.eu/ontology/WhoDhiTargetedClientCommunication> 1.0
[...]
However this should not be possible, because for example http://hitontology.eu/ontology/EhrSfmSupportForHealthMaintenancePreventativeCareAndWellness only has one catalogue, and this cannot be smaller than itself, as specified in LESS_THAN(c1.cat,c2.cat)
.
$ limes test-sparql.xml
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
09:13:15.813 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:125 - Checking for file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.821 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:128 - Found cached data. Loading data from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.859 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:134 - Cached data loaded successfully from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-836321652.ser
09:13:15.860 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:135 - Size = 618
09:13:15.860 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:125 - Checking for file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.860 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:128 - Found cached data. Loading data from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.873 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:134 - Cached data loaded successfully from file /home/konrad/projekte/hito/ontology/scripts/limes/cache/-1092215045.ser
09:13:15.874 [main] [] INFO org.aksw.limes.core.io.cache.HybridCache:135 - Size = 618
09:13:16.205 [main] [] WARN org.apache.sis.system:228 - The “SIS_DATA” environment variable is not set.
09:13:17.171 [main] [] INFO org.aksw.limes.core.controller.Controller:237 - Mapping task finished in 1218 ms
09:13:17.175 [main] [] INFO org.aksw.limes.core.controller.Controller:241 - Mapping size: 620 (accepted) + 1520 (need verification) = 2140 (total)
09:13:17.176 [main] [] INFO org.aksw.limes.core.controller.Controller:108 - Writing result files...
09:13:17.176 [main] [] INFO org.aksw.limes.core.io.serializer.SerializerFactory:32 - Getting serializer with name CSV
09:13:17.199 [main] [] INFO org.aksw.limes.core.controller.Controller:111 - Writing statistics file...
Thank you for the detailed explanation, this is extremely helpful! Could you add this to the official documentation at http://dice-group.github.io/LIMES/#/user_manual/configuration_file/defining_link_specifications?id=boolean-operations? I know what minimum, maximum, and set differences are but the interaction with the thresholds was not clear to me. However what I still don't know is: What is the similarity score output of the MINUS operator? The ones from the first parameter? And what if something is below the threshold?
Actually, the MIN(m1, m2)
is the entries (i.e., links) with minimum similarities in both m1
and m2
, where nonexisting entries in both m1
and m2
are assumed to have a similarity of 0
. Therefore, if one link l
only exists in one m1
for instance, then we conceder that m2
contains the same link l
with a similarity of 0
. Therefore, we do not return l
as it would have the minimum similarity of 0
. The MAX(m1, m2)
has the same semantics.
MINUS(m1,m2)
will only return links from m1
with their respective similarities, only in case such links do not exist in m2
.
Done updating the LIMES docs
Is it possible to use LIMES with more than two sources which are all included in the same file? The sources should be mapped to each other but of course I don't want to map a source to itself and I also don't want to have duplicate pairs (A,B) and (B,A). To clarify with an example, lets say I have a class :Country with many instances and each country has a population of individuals. All of this data is in the same file countries.ttl. Now I want to find out, which individuals live in more than one country.
This can be done in the following manner, declaring source and target alike:
However this will generate a false match for every person to itself, and also it will also match each pair twice in both directions. I would like to add a restriction like "STR(?x) < STR(?y)" but it seems like one cannot reference variables from the source in the restriction of the target. A workaround is to throw away all matches with score exactly 1.0 but this is wasteful on resources and also discards correct matches that happen to be exactly equal. Also, this will map people in a country to others in the same country which is not intended.
Another way is to perform postprocessing to remove all duplicate and self matches but that seems to be inefficient in both developer and execution time.
Lastly, I could write a script which would enumerate all n*(n-1)/2 unique non self-matching pairs and generate as many limes configuration files but that has its own problems.
Is there any way to solve this task efficiently using LIMES or do I need to use one of the mentioned imperfect options?