diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

Scaling of MisplacedClassesOrProperties #24

Closed jerdeb closed 10 years ago

jerdeb commented 10 years ago

The MisplacedClassesOrProperties metric is taking a lot of time to compute on EBI datasets

muhammadaliqasmi commented 10 years ago

One of the primary reasons for these performance issues is that some of these quality metrics (like this one) uses ..utilities.VocabularyReader.java to download vocabularies from the internet and it takes some considerable resources ( especially in terms of time ).

In order to improve the performance of VocabularyReader.java initially we ( @nfriesen & I ) discussed and implemented to memory cache to reduce the number of downloads for vocabularies that are frequently used by quality metrics.

However, we further discovered that it was not sufficient. Therefore, another level of cache ( i.e. file cache ) is now added to the current implementation.

For file cache the vocabularies are downloaded and stored in ../src/main/resources/models directory. This file cache can be disabled if 'models' directory is removed from the previously mentioned location. Similarly if 'models' directory is created in the above mentioned path then file cache will enable again.

So the overall flow of VocabularyReader.java will be to first look for required vocabularies in memory cache, if not found then look in file cache (if found store it in memory cache) and if still not found then download vocabulary from web and stored it into file and memory cache.

Furthermore, I have excluded the content inside 'models' directory from git. Git will not stage, commit or push any thing inside this directory.