CERNDocumentServer / cds-rdm

New CDS site based on InvenioRDM
https://repository.cern
7 stars 11 forks source link

Subject values: evaluate relevancy of the existing list or update #249

Open zzacharo opened 2 weeks ago

zzacharo commented 2 weeks ago
PaulinaBaranowska commented 2 weeks ago

I will also look into this

PaulinaBaranowska commented 2 weeks ago
PaulinaBaranowska commented 2 weeks ago

Questions:

ntarocco commented 2 weeks ago
* Other => XX (should it be the other way round? We do not need it as we already have `Other Subjects`?)

@PaulinaBaranowska in the new CDS, it does not make any more sense to have a Other field, as you can insert free text. OK to drop the Other value in the new CDS (but keep it in current CDS)?

* Particle Physics-Experiment is twice in CDS KB (once as `AB` and once as `e`), e should be deleted.

Thanks, we will fix it. We can keep the AB, drop the e and check if we need to bulk-update records. What do you think?

michamos commented 2 weeks ago

@ntarocco do you mean it won't be a controlled vocabulary? or it will be but there is also an escape hatch?

zzacharo commented 2 weeks ago

@ntarocco do you mean it won't be a controlled vocabulary? or it will be but there is also an escape hatch?

It means that in the system if a user doesn't find the value in the controlled vocabulary they can always add it as free text.

michamos commented 1 week ago

I don't think we want to allow free text subjects here, that defeats the purpose. Can't they use keywords for that? or are subjects and keywords the same thing? It might be useful if you give some more info on how this would look in the schema.

zzacharo commented 1 week ago

In the new system, user will autocomplete from the subjects vocabulary and if they do not find what they are looking for then they add it as free text. We store both the controlled values in the subjects field as follows:

"subjects": [{
    "id": "Accelerators and Storage Rings",
    "subject": "Accelerators and Storage Rings",
    "scheme": "CERN"
  },
  {
    "subject": "myvalue"
}],

Subjects without a specific id are considered keywords. These are shown at the moment like below in the record's detail page:

Screenshot 2024-11-14 at 09 57 33

You can see for example this record: https://dev-cds-rdm.web.cern.ch/records/mddtr-zvt57

michamos commented 1 week ago

Thanks @zzacharo for the explanation. Note that we rely on the subjects for the annual report stats (see bottom diagram on the dashboard, but not sure to what extent we care about those for the rest of CDS. We will discuss further with @agentilb.

PaulinaBaranowska commented 1 week ago

After some discussion with @agentilb, the approach to autocomplete from the list of subjects, and then if they don't find what they are looking for to input it as freetext, that seems like a good solution.

Would it be possible to extract any values in 65017_a that are outside of the values from the Knowledge base and send them to us? We can then clean them, or if that is not necessary or too much work, we can migrate them to the keywords in the new CDS.