DataONEorg / d1_cn_index_processor

The CN index processor component
0 stars 1 forks source link

geohashes, text fields not being indexed for SO documents #26

Closed gothub closed 3 years ago

gothub commented 3 years ago

Indexing rules (Spring beans) need to be added to application-context-schema-org.xml to populate the geohas_* and text fields. The existing classes should be reused to accomplish this, i.e for EML these classes are used:

(see main/resources/application-context-eml-base.xml for details of EML bean definitions)

gothub commented 3 years ago

For the Solr 'text' field that is derived from an SO document, there are two different approaches the indexer can employ to extract the required values from the document to populate the Solr field:

  1. use a SPARQL query that returns all string values from the SO document (where the RDF triple object is a literal value). This can be accomplished with the query:
    SELECT DISTINCT (str(?string) as ?text)
    WHERE {
    {
        ?a ?b ?string .
        filter(datatype(?string) = xsd:string) .
    }
    UNION
    {
        ?a ?b ?string .
        filter(datatype(?string) = SO:HTML) .
    }
    }
    • the downside of this approach is that any string item from the document will be retrieved, even items that are not retrieved for any Solr field.
    • this is the approach used for EML document indexing.
  2. Concatenate the values of a specific list of 'field' values for items that we are explicitly retrieving from the document, for example, we would concatenate the values for 'title', 'abstract', 'keywords', etc., using the queries that are already defined for these fields.
    • the downside to this approach is that the list of items to retrieve may need to be updated in the future, e.g. for 'license' when it is added to the index.

Note that these two approaches derive different solutions for the Solr 'text' field.

@mbjones @datadavev @taojing2002 which method should be implemented?

gothub commented 3 years ago

The approaches described above will return these values for the Solr "text" field: h the pack ice habitat

  1. When using a single SPARQL query that returns all strings from document (word count 3680):

    https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R https://www.example-data-repository.org
    Example Data Repository yrday_local http://lod.example-data-repository.org/id/dataset-parameter/20879
    local day and decimal time, as 326.5 for the 326th day of the year, or November 22 at 1200 hours (noon)
    latitude, in decimal degrees, North is positive, negative denotes South time_sample
    http://lod.example-data-repository.org/id/dataset-parameter/20863 minutes Number of minutes between collection and sampling for pigment content;
    decline of pigment content with time was used to calculate time to clear the gut of pigment.
    text/tab-separated-values 2010-02-03 https://www.example-data-repository.org/dataset/3300/data/larval-krill.tsv
    Spatial Reference System http://www.wikidata.org/entity/Q161779 http://www.opengis.net/def/crs/OGC/1.3/CRS84
    lat http://lod.example-data-repository.org/id/dataset-parameter/20874 decimal degrees https://www.example-data-repository.org/dataset/3300
    Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project)
    Hand-held plankton net Manual Biota Sampler oceans krill biota larval krill pigments Quetin, L., Ross, R. (2010) Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106,
    LMG0205 in the Southern Ocean from 2001-2002 (SOGLOBEC project). Example Data Repository. Version 1. doi:10.1234/1234567890 [access date]
    2001-08-06/2002-09-09 1 https://creativecommons.org/licenses/by/4.0/ https://doi.org/10.1234/1234567890 year
    http://lod.example-data-repository.org/id/dataset-parameter/20861 calendar year month_local
    http://lod.example-data-repository.org/id/dataset-parameter/20877 cruiseid http://lod.example-data-repository.org/id/dataset-parameter/20860
    text sample_id http://lod.example-data-repository.org/id/dataset-parameter/20862 day_local
    http://lod.example-data-repository.org/id/dataset-parameter/20876 stage_id http://lod.example-data-repository.org/id/dataset-parameter/20865
    NSF Antarctic Sciences NSF ANT pigment_content http://lod.example-data-repository.org/id/dataset-parameter/20864 micrograms
    total chl/grams wet weight https://registry.identifiers.org/registry/doi doi:10.1234/1234567890 http://doi.org/abcd
    https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R wet_weight http://lod.example-data-repository.org/id/dataset-parameter/20866
    mg lon http://lod.example-data-repository.org/id/dataset-parameter/20875 time_local http://lod.example-data-repository.org/id/dataset-parameter/20878
    https://www.example-data-repository.org/person/51160 Dr Robin Ross month, local time https://www.example-data-repository.org/person/51159
    Dr Langdon Quetin -68.4817 -75.8183 -65.08 -68.5033 well-known text (WKT) representation of geometry http://www.wikidata.org/entity/Q4018860
    POLYGON ((-75.8183 -68.4817, -68.5033 -68.4817, -68.5033 -65.08, -75.8183 -65.08, -75.8183 -68.4817)) pigment content cruise identification year of experiment day of month,
    local time longitude, in decimal degrees, East is positive, negative denotes West time of day, local time, using 2400 clock format sample identification:
    WBC=whole body clearance expt.; WBF=whole body fluorescence on collection stage development index of larvae in sample
    (furcilia = F1-6 = 1-6,  juvenile = J=7) Dr Roberta Marinelli https://orcid.org/0000-0001-7775-xxxx average wet weight/larvae in sample
    ANT-9909933 https://www.example-data-repository.org/award/55102 http://www.nsf.gov/awardsearch/showAward.do?AwardNumber=9909933
    Winter ecology of larval krill: quantifying their interaction with the pack ice habitat
  2. When using string values from only defined Solr fields (word count: 869):

    NSF Antarctic Sciences https://example.org/executions/execution-42 biota 2002-09-09T00:00:00.000Z Dr Langdon Quetin
    https://somerepository.org/datasets/10.xxxx/Dataset-101/process-script.R Dr Robin Ross 1 Winter ecology of larval
    krill: quantifying their interaction with the pack ice habitat. larval krill pigments https://www.example-data-repository.org/dataset/3300/data/larval-krill.tsv
    Larval krill studies - fluorescence and clearance from ARSV Laurence M. Gould LMG0106, LMG0205 in the Southern Ocean from 2001-2002
    (SOGLOBEC project) 2010-02-03T00:00:00.000Z 2001-08-06T00:00:00.000Z https://doi.org/10.xxxx/Dataset-1
    https://somerepository.org/datasets/10.xxxx/Dataset-2.v2/process-script.R http://purl.dataone.org/provone/2015/01/15/ontology#Data
    lon https://somerepository.org/datasets/10.xxxx/Dataset-101 https://example.org/executions/execution-101

The second technique includes these fields:

It does not include fields:

amoeba commented 3 years ago

Since our text field is supposed to store the "full text of the metadata record", I'd vote for the first. If the size of what we're tossing in Solr for JSON-LD/SOSO docs is a problem, we also have that problem for EML and ISO docs so I'd say it's not really a problem here.

mbjones commented 3 years ago

I'd vote for the first too, and agree with Bryce's reasoning.

gothub commented 3 years ago

Indexing of text field for schema.org documents added in commit 1d4bda6d387f3a45adecdf5b6a0de2fecaa8120a