Closed wowasa closed 6 years ago
5001cb03da203d6d47d342260370765d30b944e4 introduces a change that breaks XInclude, since at the moment of parsing there is no known systemId. The solution probably is to make mappingDefinitionResolver.tryResolveUrlFileOrResourceStream
return an InputSource. I will attempt this and see if it fixes it.
Analysing a local import of ~50 records I have found some differences in the import results for current development and the issue181b* branches (before and after refactoring seems the same). For example the values and/or value counts for licenseType
and resourceClass
are different after import between development and 181b. I will create tests for these in the development branch so that we can debug 181b more easily and verify any fix.
Apart from values and value counts, order also seems different in cases where language codes are present. I think this is due to how the new adds English content to the front of the list, which results in the last encountered English value becoming the first whereas I believe the old logic would keep order within the preferred language (i.e. first encountered English value becomes first).
Here's an example of a difference between current development and current issue181 (for this file):
development
branch
"description":["{code:eng}various arabic dialects",
"{code:eng}Dictionary Gate is a growing collection of lexical resources",
"{code:eng}a simple search interface to the dictionaries in the dict-gate"],
vs issue181b
branch
"description":["{code:eng}a simple search interface to the dictionaries in the dict-gate",
"{code:eng}Dictionary Gate is a growing collection of lexical resources",
"{code:eng}various arabic dialects"],
Former (development
) represents order in which values occur in the XML. Notice that in this case all values are in English anyway.
As of 9c779a9 development has additional tests to cover some sample cases. Merged to the issues181b branches (bdc9ce2), for which a number of tests fail (full log):
Failed tests: testOlac(eu.clarin.cmdi.vlo.importer.CMDIDataProcessorTest): expected:<10> but was:<9>
testLrtCollection(eu.clarin.cmdi.vlo.importer.CMDIDataProcessorTest): expected:<Estonian> but was:<null>
testCreateCMDIDataFromSession(eu.clarin.cmdi.vlo.importer.CMDIDataProcessorTest): expected:<English> but was:<null>
testMultilingualValues(eu.clarin.cmdi.vlo.importer.MetadataImporterTest): Order should be preserved except for preferred language priority(..)
testDerivedFacetsWithPostProcessing(eu.clarin.cmdi.vlo.importer.MetadataImporterTest): Language code -> language name post processing expected:<Dutch> but was:<null>
testDefaultValuePostProcessing(eu.clarin.cmdi.vlo.importer.MetadataImporterTest): Explicit license filled in from availability expected:<PUB> but was:<null>
order within a language is in any way different than the order before, since in branch 181b English values are put at the beginning of the list while non-English values at the end. But I assume that the order before was randomly since the list was filled by different sources (origin value, normalized value(s), derived facet value) and it would be even more randomly if we use CFM at a certain point. So in my view we can't just take the previous order as given but we have first to decide whether we need a certain order. If this is the case we also have to check whether the development branch fulfills the requirement even when using CFM. Anyhow, it wouldn't be a big deal to implement the old algorithm again.
@wowasa the order we ended up with before was not entirely randomly. The logic was designed to keep the document order within each 'subscope' as their is often a certain semantics to the order in which values appear. So it would really be prefereable if that behaviour could be retained.
Some related discussions and code changes can be found in #83 and #44.
implemented and merged into development
134608c4fb2f4009ad2fac6535a9f50fa025d90d...ee076b807911c94ca2ed28070238e27a006c95bc fixes an issue probably introduced in the refactoring process for this issue which prevented values of types other than String from being added to Solr documents (which caused 'days since last seen' and the resource count values to be unavailable).
The next release of the curation module will include the vlo-importer as a shared library to assure that curation and import use the same logic. In this context we need some slight modifications of the vlo-importer to make parts of the logic accessible: