denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
Other
0 stars 2 forks source link

Incorrect parsing of entity.topics #42

Closed gjost closed 7 years ago

gjost commented 7 years ago

GFroh 2017-07-10 14:47 [It looks] like the topic term for Hawai'i is not being properly inserted into the data itself. It is being rendered as:

{
    "id": "277",
    "term": "i"
},

The other behavior I've just noticed is that the editor seems to be munging the "topics" dict when it reloads the data. Here's an example of the diff from ddr-pc-33/files/ddr-pc-33-15/entity.json

<<<<<<< HEAD
"id": "Community publications: Pacific Citizen:389",
"term": "Journalism and media"
=======
"id": "389",
"term": "Journalism and media: Community publications: Pacific Citizen"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9

The b version is the original data; the HEAD a version is what the editor is doing to existing topic data.

gjost commented 7 years ago

Behavior confirmed:

(ddrlocal)ddr@denshodeb8:/usr/local/src/ddr-local/ddrlocal$ python manage.py shell
>>> 
>>> from DDR import identifier
>>> e = identifier.Identifier(id='ddr-pc-33-15', base_path='/var/www/media/ddr').object()
>>> for topic in e.topics:
...     print topic
... 
{u'term': u'i', u'id': u'277'}
{u'term': u'Journalism and media: Community publications: Pacific Citizen', u'id': u'389'}
{u'term': u'Race and racism', u'id': u'36'}
{u'term': u'Race and racism: Cross-racial relations', u'id': u'38'}
{u'term': u'Race and racism: Discrimination', u'id': u'37'}

Note that this entity.json is already damaged:

(ddrlocal)ddr@denshodeb8:/var/www/media/base/ddr-pc-33$ less files/ddr-pc-33-15/entity.json 
...
    {
        "topics": [
            {
                "id": "277",
                "term": "i"
            },
            {
                "id": "389",
                "term": "Journalism and media: Community publications: Pacific Citizen"
            },
            {
                "id": "36",
                "term": "Race and racism"
            },
            {
                "id": "38",
                "term": "Race and racism: Cross-racial relations"
            },
            {
                "id": "37",
                "term": "Race and racism: Discrimination"
            }
        ]
    },
...

On the plus side, the ID number is intact and writing the entity file doesn't seem to further damage the data:

(ddrlocal)ddr@denshodeb8:/usr/local/src/ddr-local/ddrlocal$ python manage.py shell
>>> from DDR import identifier
>>> e = identifier.Identifier(id='ddr-pc-33-15', base_path='/var/www/media/ddr').object()
>>> e.write_json()
>>>
(ddrlocal)ddr@denshodeb8:/var/www/media/base/ddr-pc-33$ git diff
diff --git a/files/ddr-pc-33-15/entity.json b/files/ddr-pc-33-15/entity.json
index 8316ca1..0171b0c 100644
--- a/files/ddr-pc-33-15/entity.json
+++ b/files/ddr-pc-33-15/entity.json
@@ -1,10 +1,10 @@
 [
     {
-        "app_commit": "9d906ffdb5df85c59fd57034abcb424bb302202d  (HEAD, origin/209-upgrade-elasticsearch, 209-upgrade-elasticsearch) 2017-01-30 17:45:05 -0800",
+        "app_commit": "00d6bf004a20c921f921fa5f28616ce642a51958  (HEAD, tag: v2.0, origin/master, origin/HEAD, master) 2017-05-03 11:27:32 -0700",
         "app_release": "0.9.4-beta",
         "application": "https://github.com/densho/ddr-cmdln.git",
-        "git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: unknown\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
-        "models_commit": "2106bb0a6c686e4258c0d9d02d1ced96c02f357f  2017-01-23 17:11:28 -0800"
+        "git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: 5\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
+        "models_commit": "8c5e0b200fe5f02c9216fd4bc3be42d46d881cf5  2017-02-01 14:36:59 -0800"
     },
     {
         "id": "ddr-pc-33-15"
gjost commented 7 years ago

Detail from the most recent commit.

(ddrlocal)ddr@denshodeb8:/var/www/media/base/ddr-pc-33$ git log -n1 --format=full --patch files/ddr-pc-33-15/entity.json
commit 26b61ec199b3e3e9ffa189caa18a2c795f8756e9
Author: DDRAdmin <REDACTED@SERVER.ORG>
Commit: DDR Integration Manager <REDACTED@SERVER.ORG>

    Manual commit after ddr-transform run

diff --git a/files/ddr-pc-33-15/entity.json b/files/ddr-pc-33-15/entity.json
index 66001e4..8316ca1 100644
--- a/files/ddr-pc-33-15/entity.json
+++ b/files/ddr-pc-33-15/entity.json
...
@@ -80,11 +84,26 @@
     },
     {
         "topics": [
-            "Geographic communities: Hawai'i [277]",
-            "Journalism and media: Community publications: Pacific Citizen [389]",
-            "Race and racism [36]",
-            "Race and racism: Cross-racial relations [38]",
-            "Race and racism: Discrimination [37]"
+            {
+                "id": "277",
+                "term": "i"
+            },
+            {
+                "id": "389",
+                "term": "Journalism and media: Community publications: Pacific Citizen"
+            },
+            {
+                "id": "36",
+                "term": "Race and racism"
+            },
+            {
+                "id": "38",
+                "term": "Race and racism: Cross-racial relations"
+            },
+            {
+                "id": "37",
+                "term": "Race and racism: Discrimination"
+            }
         ]
     },
     {
...
gjost commented 7 years ago

Topics appear fine in ddr-local, including the topic in question. Topic titles used by TagManager are retrieved from the vocabs API, only the ID in the data is used, and the ID appears to be fine.

gjost commented 7 years ago

Created test object with ddrlocal. Parsing error happens on both read and write.

...
{
  "topics": [
    {
      "id": "241", "term": "Fiction"
    },
    {
      "id": "277", "term": "i"
    },
    {
      "id": "268", "term": "Pottery"
    }
  ]
},
...
gjost commented 7 years ago

The topics field in ddr-defs (usr/local/src/ddr-local/ddr-defs/repo_models/entity.py) is going through a bunch of different converter functions.

def jsonload_topics(text): return converters.text_to_bracketids(text, ['term','id'])
def display_topics( data ): return _display_multiline_dict('<a href="{{ data.id }}">{{ data.term }}</a>', data)
def formprep_topics(data): return converters.listofdicts_to_textnolabels(data, ['term','id'])
def formpost_topics(text): return converters.text_to_dicts(text, ['term', 'id'])
def csvload_topics( text ): return converters.text_to_listofdicts(text)
def csvdump_topics(data): return converters.listofdicts_to_text(data)

This is partially by design: jsonload_* is supposed to ingest old formats but only save things in new/current format.

gjost commented 7 years ago

Regex in converters.py thought the single-quote was a word boundary. Fixed in b8a6c44 (not merged yet).