clarin-eric / VLO

Virtual Language Observatory
GNU General Public License v3.0
14 stars 6 forks source link

code change to use vlo-importer logic in curation-module #181

Closed wowasa closed 6 years ago

wowasa commented 6 years ago

The next release of the curation module will include the vlo-importer as a shared library to assure that curation and import use the same logic. In this context we need some slight modifications of the vlo-importer to make parts of the logic accessible:

  1. eu.clarin.cmdi.vlo.importer.processor.CMDIParserVTDXML Currently the method matchPattern identifies raw values and post-processes them in the same task. For curation we need a disassembling of the two tasks identification and post-processing, which allows the curation module to display raw and post-processed values.
  2. eu.clarin.cmdi.vlo.importer.mapping.FacetMappingFactory In addition to the static map of facetmappings the curation module needs a volatile map for mappings based on private profiles. Therefore we would like to derive a class which needs access to some so far private methods and the map conditionTargetSetPerFacet, which have to become protected.
twagoo commented 6 years ago

5001cb03da203d6d47d342260370765d30b944e4 introduces a change that breaks XInclude, since at the moment of parsing there is no known systemId. The solution probably is to make mappingDefinitionResolver.tryResolveUrlFileOrResourceStream return an InputSource. I will attempt this and see if it fixes it.

twagoo commented 6 years ago

Analysing a local import of ~50 records I have found some differences in the import results for current development and the issue181b* branches (before and after refactoring seems the same). For example the values and/or value counts for licenseType and resourceClass are different after import between development and 181b. I will create tests for these in the development branch so that we can debug 181b more easily and verify any fix.

twagoo commented 6 years ago

Apart from values and value counts, order also seems different in cases where language codes are present. I think this is due to how the new adds English content to the front of the list, which results in the last encountered English value becoming the first whereas I believe the old logic would keep order within the preferred language (i.e. first encountered English value becomes first).

Here's an example of a difference between current development and current issue181 (for this file):

Former (development) represents order in which values occur in the XML. Notice that in this case all values are in English anyway.

twagoo commented 6 years ago

As of 9c779a9 development has additional tests to cover some sample cases. Merged to the issues181b branches (bdc9ce2), for which a number of tests fail (full log):

Failed tests:   testOlac(eu.clarin.cmdi.vlo.importer.CMDIDataProcessorTest): expected:<10> but was:<9>
  testLrtCollection(eu.clarin.cmdi.vlo.importer.CMDIDataProcessorTest): expected:<Estonian> but was:<null>
  testCreateCMDIDataFromSession(eu.clarin.cmdi.vlo.importer.CMDIDataProcessorTest): expected:<English> but was:<null>
  testMultilingualValues(eu.clarin.cmdi.vlo.importer.MetadataImporterTest): Order should be preserved except for preferred language priority(..)
  testDerivedFacetsWithPostProcessing(eu.clarin.cmdi.vlo.importer.MetadataImporterTest): Language code -> language name post processing expected:<Dutch> but was:<null>
  testDefaultValuePostProcessing(eu.clarin.cmdi.vlo.importer.MetadataImporterTest): Explicit license filled in from availability expected:<PUB> but was:<null>
wowasa commented 6 years ago

order within a language is in any way different than the order before, since in branch 181b English values are put at the beginning of the list while non-English values at the end. But I assume that the order before was randomly since the list was filled by different sources (origin value, normalized value(s), derived facet value) and it would be even more randomly if we use CFM at a certain point. So in my view we can't just take the previous order as given but we have first to decide whether we need a certain order. If this is the case we also have to check whether the development branch fulfills the requirement even when using CFM. Anyhow, it wouldn't be a big deal to implement the old algorithm again.

twagoo commented 6 years ago

@wowasa the order we ended up with before was not entirely randomly. The logic was designed to keep the document order within each 'subscope' as their is often a certain semantics to the order in which values appear. So it would really be prefereable if that behaviour could be retained.

Some related discussions and code changes can be found in #83 and #44.

wowasa commented 6 years ago

implemented and merged into development

twagoo commented 6 years ago

134608c4fb2f4009ad2fac6535a9f50fa025d90d...ee076b807911c94ca2ed28070238e27a006c95bc fixes an issue probably introduced in the refactoring process for this issue which prevented values of types other than String from being added to Solr documents (which caused 'days since last seen' and the resource count values to be unavailable).