geonetwork / core-geonetwork

GeoNetwork is a catalog application to manage spatially referenced resources. It provides powerful metadata editing and search functions as well as an interactive web map viewer. It is currently used in numerous Spatial Data Infrastructure initiatives across the world.
http://geonetwork-opensource.org/
GNU General Public License v2.0
427 stars 489 forks source link

CSW GetRecords returns duplicated results #1728

Open josegar74 opened 8 years ago

josegar74 commented 8 years ago

Using the following CSW query:

<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" 
                xmlns:ogc="http://www.opengis.net/ogc" 
                xmlns:gmd="http://www.isotc211.org/2005/gmd" 
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
                xmlns:apiso="http://www.opengis.net/cat/csw/apiso/1.0" 
                service="CSW" 
                version="2.0.2" 
                maxRecords="10" 
                startPosition="1" 
                resultType="results" 
                outputSchema="http://www.isotc211.org/2005/gmd" 
                outputFormat="application/xml">
  <csw:Query typeNames="gmd:MD_Metadata">
    <csw:ElementSetName>full</csw:ElementSetName>
    <ogc:SortBy>
      <ogc:SortProperty>
        <ogc:PropertyName>apiso:Identifier</ogc:PropertyName>
        <ogc:SortOrder>ASC</ogc:SortOrder>
      </ogc:SortProperty>
    </ogc:SortBy>
  </csw:Query>
</csw:GetRecords>

And changing the startPosition to 1, 11, 21, etc. the results obtained contain duplicated values.

See the returned uuids for startPosition 1, 11 and 21 (highlighted in bold the duplicates):

{28B6A593-F97C-45D1-A906-F582C7FF60A6} {D2035256-F773-457D-89B4-E244DC969C6B} {D7BFE429-433D-4B83-9BDE-33821EE8F702} 03c77466-08b5-4112-9927-dd3446a231c2 0bc6e9d2-e6cd-433c-b00a-01e8dcbdcffe 0e0b1866-7437-4b5a-9828-c5db2f47ad49 147d3a15-61f3-42df-9c6b-d02cb0d0ea26 1e32097d-7e72-4318-b018-ae67faf9b430 1f8bd0da-4e6b-4482-8aa4-c4375f575ea4 22047628-fc1b-441f-a7a6-8d530bea7ec8

22047628-fc1b-441f-a7a6-8d530bea7ec8 232844d3-26c5-47e4-9dae-1386fc4647e9 24a0cb46-d2a4-4282-8ac6-7d8ee40e9d1d 24c018e0-e9bd-4315-8e64-b9dbe724710c 24e1c57f-0de7-4b69-9e00-bf82dd3371b4 466ed27a-d614-4925-970b-7050ca182e49 4a4f5e3c-91ce-4f1c-8bbf-cd8b2e87fb5d 50a40489-d126-4d2e-8c1c-5a23b7673206 57e57c41-49e3-4dda-8519-d46e02b06875 59129f4b-a61b-467c-9b41-1f92e4338151

4a4f5e3c-91ce-4f1c-8bbf-cd8b2e87fb5d 50a40489-d126-4d2e-8c1c-5a23b7673206 57e57c41-49e3-4dda-8519-d46e02b06875 59129f4b-a61b-467c-9b41-1f92e4338151 591e7f88-c443-4659-b8b7-23601d647ee6 59275bc8-d0a4-45e4-b054-2b1eeb3ee293 59379275-3f11-448a-9120-bb5da3af7f67 5acb98d4-a867-434f-958a-3d851c85cc55 5cfe8a91-3dc9-4cf6-a40a-6a6d6f3124ab 5e1f0181-9d14-450c-a59b-e06e5ac28a36

Not very clear the cause, checking if can be related to some metadata having curly braces in the uuids.

Tested in 3.2.x, but seem the same in 3.0.x.

josegar74 commented 8 years ago

I have done a test removing the metadata with uuids containing curly braces and similar result, some pages of CSW response have duplicates, so no relation with this as indicated as a possibility in previous comment.

The total number of results in CSW is equal to the number of metadata in the catalogue, but some uuids are not returned and instead duplicates of others are returned. This makes difficult to identify the error and not sure if this is something since 3.0.x or has been in GeoNetwork from the starting.

Some of the uuids in my catalogue are not "standard" uuids, like:

RMI_ALARO_WCS_MD_57f647b628f083.72732840
706cd0654ec6d2b12e6279a907ba03ccb72586a

But I think that should not be a issue? @fxprunayre are you aware of any limitation about this? or any idea what can cause this bizarre behaviour?

The code that retrieves the metadata is this, but not clear what causes this behaviour for now:

https://github.com/geonetwork/core-geonetwork/blob/develop/core/src/main/java/org/fao/geonet/kernel/search/LuceneSearcher.java#L644

For that method the parameters numHits, startHit and endHit seem at least that the documentation is not accurate causing quite confusion.

     * @param numHits        the maximum number of hits to collect
     * @param startHit       the start hit to return in the TopDocs if not building summary
     * @param endHit         the end hit to return in the TopDocs if not building summary 

For example, querying CSW with startPosition=51 and maxResults=10, the following values are defined for the method parameters:

     numHits= 61
     startHit=50
     endHit=10
fxprunayre commented 8 years ago

But I think that should not be a issue? @fxprunayre are you aware of any limitation about this? or any idea what can cause this bizarre behaviour?

Could that be related to record having same UUID in the XML or in the UUID column of metadata table in the database (both should be the same) ?

josegar74 commented 8 years ago

Doesn't seem the case. To test I have done this:

1) Select distinct uuid from metadata --> 90 values, same as the total records

2) Do a CSW GetRecords with maxRecords=100 to get all results (90) and with a xslt process extract the uuids

3) Compare the uuids from 1) and 2). The uuids are the same

The problem happens when pagination is used.

josegar74 commented 7 years ago

Checking LuceneSearcher.doSearchAndMakeSummary:

https://github.com/geonetwork/core-geonetwork/blob/develop/core/src/main/java/org/fao/geonet/kernel/search/LuceneSearcher.java#L655

TopFieldCollector tfc = TopFieldCollector.create(sort, numHits, true, 
    trackDocScores, trackMaxScore, docsScoredInOrder);

With the following CSW GetRecords intervals, evaluating tfc.topDocs().

startPosition="1": numHits=11

topdocs-1

startPosition="11": numHits=21

topdocs-11

startPosition="61": numHits=71

topdocs-61

In the last request the uuid with the curly-braces is returned, but it's at the first position, that causes the issue.

1) No clear why not returned in the previous requests.

2) Also no clear why numHits is increased, maybe related to the sorting, but then if the catalogue contains many records, probably this can be a memory issue when requesting last pages.

As as side note:

In the q service the search is done in 2 steps: search and present, both call LuceneSearcher.doSearchAndMakeSummary, but search uses 10 hits and get the same issue as for CSW, but the present step uses the total hits so result returned is fine. Not clear the reason for this double call, but ok.

@fxprunayre, maybe with this feedback helps to get an idea of the issue?

josegar74 commented 7 years ago

Adding the following code, to do similar to the search/present steps that are done in q, solves the issue, but looks a bit terrible (calls LuceneSearcher.doSearchAndMakeSummary a second time with the total hits):

        Pair<TopDocs, Element> searchResults = LuceneSearcher.doSearchAndMakeSummary(numHits, startPosition - 1,
            maxRecords, _lang.presentationLanguage,
            luceneConfig.getSummaryTypes().get(resultType.toString()), luceneConfig,
            reader, _query, wrapSpatialFilter(),
            _sort, taxonomyReader, buildSummary
        );
        TopDocs hits = searchResults.one();
        Element summary = searchResults.two();

        numHits = Integer.parseInt(summary.getAttributeValue("count"));
        if (Log.isDebugEnabled(Geonet.CSW_SEARCH))
            Log.debug(Geonet.CSW_SEARCH, "Records matched : " + numHits);

        // NEW CODE: use numHits (total number of results)
        searchResults = LuceneSearcher.doSearchAndMakeSummary(numHits, startPosition - 1,
                maxRecords, _lang.presentationLanguage,
                luceneConfig.getSummaryTypes().get(resultType.toString()), luceneConfig,
                reader, _query, wrapSpatialFilter(),
                _sort, taxonomyReader, buildSummary
        );
        hits = searchResults.one();
        summary = searchResults.two();
       // END NEW CODE: use numHits (total number of results)

        // --- retrieve results

        List<ResultItem> results = new ArrayList<ResultItem>();

In any case checking if there's any issue with the TopFieldCollector when sorting by the uuid when values contain curly-braces.

mauriziocosmai commented 7 years ago

I don't know if it can be useful, but I had the same problem days ago, and I've solved simply by regenerating the LUCENE index, accessing the administration page of GN (rel. 2.10).

fxprunayre commented 7 years ago

Made some test on this by harvesting http://inspire.maaamet.ee/geoportal/csw and using this request

<?xml version="1.0"?>
<csw:GetRecords xmlns:csw="http://www.opengis.net/cat/csw/2.0.2"
                xmlns:gmd="http://www.isotc211.org/2005/gmd"
                service="CSW" version="2.0.2"
                maxRecords="3" 
                startPosition="16" 
                resultType="results"
                outputSchema="http://www.opengis.net/cat/csw/2.0.2">
  <csw:Query typeNames="gmd:MD_Metadata" elementnameStrategy="geonetwork26">
    <csw:ElementName>/csw:Record/dc:identifier</csw:ElementName>
  </csw:Query>
</csw:GetRecords>

and testing paging but can't reproduce for now.

josegar74 commented 7 years ago

Does that metadata have curly braces in the uuids?

ghost commented 4 years ago

Hi,

is this still the problem in Geonetwork 3.8.2?

Thanks in advance.