CERNDocumentServer / cds-videos

Access articles, reports and multimedia content in HEP
https://videos.cern.ch
GNU General Public License v2.0
16 stars 34 forks source link

internal_note field not completely migrated #1576

Closed egabancho closed 6 years ago

egabancho commented 6 years ago

Check the content of 595 in https://cds.cern.ch/record/1541893/export/hm and compare it with internal_note in https://videos.cern.ch/api/record/1541893

I think the problem came from https://github.com/CERNDocumentServer/cds-dojson/blob/6732dd22baa491d4e8a553dee30293aec77415fc/cds_dojson/marc21/fields/videos/video.py#L143 because it's not iterating over all the values but only taking one (the last one)

ntarocco commented 6 years ago

@ludmilamarian after discussions, we need your input. CDS example record for the 595 tag:

<datafield tag="595" ind1=" " ind2=" ">
    <subfield code="a">Press</subfield>
    <subfield code="s">Press Videos</subfield>
</datafield>
<datafield tag="595" ind1=" " ind2=" ">
    <subfield code="a">Press</subfield>
    <subfield code="s">Animations - Science</subfield>
</datafield>
<datafield tag="595" ind1=" " ind2=" ">
    <subfield code="a">Press</subfield>
    <subfield code="s">B-Roll Footage</subfield>
</datafield>

This will be translated to something like

{
    "internal_notes": "Press, Press Videos, Press, Animations -Science, Press, B-Roll Footage"
}

(or without duplication of Press) But we were thinking that maybe it makes more sense to have the information more structured, something like:

{
    "internal_keywords": [
        {"name": "Press", "value": "Press Videos"},
        {"name": "Press", "value": "Animations -Science"},
        {"name": "Press", "value": "B-Roll Footage"}
    ]
}

It really depends on the future needs, how we want to find this information. What do you think?

egabancho commented 6 years ago

I just run this script

from cds_dojson.marc21.utils import load
from cds_dojson.marc21.models.videos.video import model
from cds.modules.records.resolver import record_resolver
from cds.modules.deposit.api import CDSDeposit
from invenio_db import db
from invenio_indexer.api import RecordIndexer

indexer = RecordIndexer()

with open('./595.xml') as f:
    records = [xml for xml in load(f)]

for xml_record in records:
    record_595 = model.do(xml_record)
    pid, record = record_resolver.resolve(record_595['recid'])
    deposit = CDSDeposit.get_record(record.depid.object_uuid)
    if 'internal_note' in record_595:
        record['internal_note'] = record_595['internal_note']
        deposit['internal_note'] = record_595['internal_note']
    else:
        try:
            del record['internal_note']
            del deposit['internal_note']
        except:
            print(record['recid'])
    if 'internal_categories' in record_595:
        record['internal_categories'] = record_595['internal_categories']
        deposit['internal_categories'] = record_595['internal_categories']
        press = record_595.get('internal_categories', {}).get('Press', [])
        if press:
            record['Press'] = press
            deposit['Press'] = press

    deposit.commit()
    record.commit()
    db.session.commit()
    indexer.index(record)
    indexer.index(deposit)

Which in the end gives this list of URLs for the press office: