cernopendata / opendata.cern.ch

Source code for the CERN Open Data portal
http://opendata.cern.ch/
GNU General Public License v2.0
661 stars 147 forks source link

schema: make the field 'collections' compatible between records and documents #3630

Closed psaiz closed 4 months ago

tiborsimko commented 5 months ago

The notion of "collection" was used many years ago for links such as:

Currently, and since many years, these are being redirected to the faceted search:

(Note that the redirections are not perfect, e.g. the above 1st and 4th link example is redirected nicely, the 2nd and 3rd is not!)

That is to say that the "collection" notion is mostly of historical interest, and we use it still just for persistence purposes, in order to serve some good content for the old links that may be lying somewhere on the web still (even though not advertised since many years now).

Therefore, if we have in the docs values such as:

    "collections": [
      {
        "experiment": "CMS"
      }
    ],

This pull request transforms this to:

    "collections": [
      "CMS"
    ],

This is OK, but it should not be fully necessary, because the document will be matched anyway by the above faceted-search redirection, which is using the experiment field, which the docs already have.

I would therefore propose to go as far as to remove this field altogether in the docs, because we don't store the value such as "CMS" in the records's collection field either. (Because "CMS" is an umbrella value expanded on by search redirection).

IOW, we should not really need to preserve the collections.experiment field in the docs.

And for more complex field values, such as:

    "collections": [
      {
        "experiment": "CMS"
      },
      {
        "primary": "education"
      },
      {
        "year": "2010-2012"
      }
    ],

The education/research dichotomy was dropped e couple of years arly after the initial web design, and the links such as:

These are now simply being redirected to show some convenient content. (Again some redirections work well, some work less well.)

The "year" collection is also outdated by the faceted search, where the documents do show up when someone selects 2010. (We did not have any prominent year-based collection browsing unlike education/research which was part of the original web design.)

Anyway, the long story short, the most important usage of the collection concept was in the records, not in docs. And in records we have the following values only:

$ for file in cernopendata/modules/fixtures/data/records/*.json; do jq -rS '.[].collections[]' $file; done | sort -u
ALICE-Derived-Datasets
ALICE-Learning-Resources
ALICE-Reconstructed-Data
ALICE-Tools
ATLAS-Derived-Datasets
ATLAS-Higgs-Challenge-2014
ATLAS-Learning-Resources
ATLAS-Simulated-Datasets
ATLAS-Tools
Author-Lists
CMS-Condition-Data
CMS-Configuration-Files
CMS-Derived-Datasets
CMS-Learning-Resources
CMS-Luminosity-Information
CMS-Open-Data-Instructions
CMS-Primary-Datasets
CMS-Simulated-Datasets
CMS-Tools
CMS-Trigger-Information
CMS-Validated-Runs
CMS-Validation-Utilities
Data-Policies
JADE-Computing-Notes
JADE-Logbooks
JADE-Tools
LHCb-Collision-Datasets
LHCb-Derived-Datasets
LHCb-Learning-Resources
LHCb-Tools
OPERA-Detector-Events
OPERA-Electronic-Detector-Datasets
OPERA-Emulsion-Detector-Datasets
PHENIX-Derived-Datasets

(We should strive to return something good for links /collection/<value> such as http://opendata.cern.ch/collection/CMS-Learning-Resources, which is mostly not the case. But that would call for another redirection-to-facets fix in app views, not for metadata massaging.)

In documents, the collection field values are more varying, such as:

  "experiment": "ALICE"
  "experiment": "ATLAS"
  "experiment": "CMS"
  "experiment": "JADE"
  "experiment": "LHCb"
  "experiment": "OPERA"
    "Guide"
  "primary": ""
  "primary": "documentation"
  "primary": "Documentation",
  "primary": "education"
  "primary": "News"
  "primary": "research"
  "primary": "VM"
  "secondary": [
  "year": "2010"
  "year": "2010-2012"
  "year": "2011"
  "year": "2013"
  "year": "2015"
  "year": "2015-2016"
  "year": "2016"

See that they mostly don't match the record values, and they are mostly emulating the experiment field and/or years and/or outdated education/research dichotomy.

Therefore, since the most important collection redirection (/collection/<experiment>) is using the experiment field (that we already have for docs), and since the other collection values such as education/research are not critical and/or not used since years, I would propose to consider dropping the collection field in all the docs fully for the sake of simplicity.

(We already don't use the collection field in several docs pages such as "cod-about", "cod-privacy-policy", "cod-terms-of-use", "simulated-dataset-categories", so if it is not really mandatory, why not think about dropping it?)

P.S. Regardless of whether we go for my suggestion, we should still look into redirections, because many redirection rules in views.py do not seem to be working right now. Some redirections should be easy to fix, some are important since they were used in published papers and materials (see past redirection issues, etc. I think that if we do fix redirections such as /collection/CMS-Primary-Datasets, we could even think of dropping the collection field from the record JSONs too; it should not be really necessary metadata-wise, since we have the type.primary and type.secondary fields that basically cover the same information-storage need as the good old collection field.

psaiz commented 5 months ago

Thanks for the comments. I like the idea to simplify things, so I will use this PR to drop the collections from the docs, create a different issue/PR that will deal with the redirections, and, once those two things are working, we can create another issue/PR for the records.