denshoproject / ddr-cmdln

Command-line tools for automating the Densho Digital Repository's various processes.
0 stars 2 forks source link

More incorrect parsing of bracketID fields #43

Closed gjost closed 6 years ago

gjost commented 7 years ago

Mon, Jul 10, 2017 at 2:47 PM [The] editor seems to be munging the "topics" dict when it reloads the data. Here's an example of the diff from ddr-pc-33/files/ddr-pc-33-15/entity.json

<<<<<<< HEAD
                "id": "Community publications: Pacific Citizen:389",
                "term": "Journalism and media"
                "id": "389",
                "term": "Journalism and media: Community publications: Pacific Citizen"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9

The b version is the original data; the HEAD a version is what the editor is doing to existing topic data.

Additional context, from the entity.json file this is taken from:

<<<<<<< HEAD
        "app_commit": "00d6bf004a20c921f921fa5f28616ce642a51958  (HEAD, tag: v2.0, origin/master, origin/HEAD, master) 2017-05-03 11:27:32 -0700",
        "app_release": "0.9.4-beta",
        "application": "",
        "git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: 5\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
        "models_commit": "8c5e0b200fe5f02c9216fd4bc3be42d46d881cf5  2017-02-01 14:36:59 -0800"
        "app_commit": "9d906ffdb5df85c59fd57034abcb424bb302202d  (HEAD, origin/209-upgrade-elasticsearch, 209-upgrade-elasticsearch) 2017-01-30 17:45:05 -0800",
        "app_release": "0.9.4-beta",
        "application": "",
        "git_version": "git version 2.1.4; git-annex version: 5.20141125\nbuild flags: Assistant Webapp Webapp-secure Pairing Testsuite S3 WebDAV Inotify DBus DesktopNotify XMPP DNS Feeds Quvi TDFA CryptoHash\nkey/value backends: SHA256E SHA1E SHA512E SHA224E SHA384E SKEIN256E SKEIN512E SHA256 SHA1 SHA512 SHA224 SHA384 SKEIN256 SKEIN512 WORM URL\nremote types: git gcrypt S3 bup directory rsync web webdav tahoe glacier ddar hook external\nlocal repository version: unknown\nsupported repository version: 5\nupgrade supported from repository versions: 0 1 2 4",
        "models_commit": "2106bb0a6c686e4258c0d9d02d1ced96c02f357f  2017-01-23 17:11:28 -0800"
>>>>>>> 26b61ec199b3e3e9ffa189caa18a2c795f8756e9
gjost commented 7 years ago

Reconstruction of last several changes to ddr-pc-33-15.

commit 1c921a6 (Thu Feb 12 16:06:46 2015 -0700)

        "topics": [
            "Geographic communities: Hawai'i [277]",                                                                                         
            "Journalism and media: Community publications: Pacific Citizen [389]",                                                           
            "Race and racism [36]",                                                                                                          
            "Race and racism: Cross-racial relations [38]",                                                                                  
            "Race and racism: Discrimination [37]"          

commit 26b61ec (Wed Feb 1 10:27:08 2017 -0800)

        "topics": [
                "id": "277",
                "term": "i"
                "id": "389",
                "term": "Journalism and media: Community publications: Pacific Citizen"
                "id": "36",
                "term": "Race and racism"
                "id": "38",
                "term": "Race and racism: Cross-racial relations"
                "id": "37",
                "term": "Race and racism: Discrimination"

If I load and write ddr-pc-33-15 (commit 26b61ec) using the current ddr-cmdln (master branch, commit 2b6f929)

(ddrlocal)ddr@denshodeb8:/usr/local/src/ddr-local/ddrlocal$ python shell
>>> from DDR import identifier
>>> e = identifier.Identifier(id='ddr-pc-33-15', base_path='/var/www/media/ddr').object()
>>> for t in e.topics:
...     print t
{u'term': u'i', u'id': u'277'}
{u'term': u'Journalism and media: Community publications: Pacific Citizen', u'id': u'389'}
{u'term': u'Race and racism', u'id': u'36'}
{u'term': u'Race and racism: Cross-racial relations', u'id': u'38'}
{u'term': u'Race and racism: Discrimination', u'id': u'37'}
>>> e.write_json()

This last command does not change the topics data at all. After the write is in the same state.

If I modify the entity.json to its state in commit 1c921a6 (top of this comment) and then load and write:

(ddrlocal)ddr@denshodeb8:/usr/local/src/ddr-local/ddrlocal$ python shell
>>> from DDR import identifier
>>> e = identifier.Identifier(id='ddr-pc-33-15', base_path='/var/www/media/ddr').object()
>>> for t in e.topics:
...     print t
{'term': u"Geographic communities: Hawai'i", 'id': u'277'}
{'term': u'Journalism and media: Community publications: Pacific Citizen', 'id': u'389'}
{'term': u'Race and racism', 'id': u'36'}
{'term': u'racial relations', 'id': u'38'}
{'term': u'Race and racism: Discrimination', 'id': u'37'}
>>> e.write_json()

The resulting entity.json:

        "topics": [
                "id": "277",
                "term": "Geographic communities: Hawai'i"
                "id": "389",
                "term": "Journalism and media: Community publications: Pacific Citizen"
                "id": "36",
                "term": "Race and racism"
                "id": "38",
                "term": "racial relations"
                "id": "37",
                "term": "Race and racism: Discrimination"

I can't see how the file changed to the state it's in in the description up top.

gjost commented 7 years ago

FYI, spent some time looking at the entity.json at the top:

<<<<<<< HEAD
        "app_commit": "00d6bf004a20c921f921fa5f28616ce642a51958  (HEAD, tag: v2.0, origin/master, origin/HEAD, master) 2017-05-03 11:27:32 -0700",
        "models_commit": "8c5e0b200fe5f02c9216fd4bc3be42d46d881cf5  2017-02-01 14:36:59 -0800"

app_commit is the version of ddr-cmdln that was used to load and write the file. models_commit is the version of ddr-defs that was used to load/write.

The DDR editor is actually two applications (ddr-local and ddr-cmdln) plus some configurations (ddr-defs). It's important that all of these match up. Except for one-off hotfixes, if I make a branch of one of these projects I always try to create identically-named branches in the other two projects. If you run the latest master of ddr-local and the develop branch of ddr-cmdln things may not work correctly. If your ddr-cmdln is from May 2017 and your ddr-defs is from February there might be problems.

In this case it looks like ddr-cmdln and ddr-defs are mismatched.

gjost commented 7 years ago

So much for that theory. I checked out the mismatched ddr-cmdln and ddr-defs and tried loading/writing ddr-pc-33-15 starting with both of the states above (commits 1c921a6 and 26b61ec). It munged the "Hawai'i" term but all the other terms/IDs behaved as expected. In other words, I was unable to duplicate the diff at the top.

gjost commented 7 years ago

Current state of ddr-pc-33-15 on kinkura/gold, commit f75ec61:

        "topics": [
                "id": "277",
                "term": "i"
                "id": "Community publications: Pacific Citizen:389",
                "term": "Journalism and media"
                "id": "36",
                "term": "Race and racism"
                "id": "Cross-racial relations:38",
                "term": "Race and racism"
                "id": "Discrimination:37",
                "term": "Race and racism"
gjost commented 7 years ago

FWIW, I made a list of all the messed-up topic IDs in ddr-pc-33. The good news is it's a finite list and the messed-up IDs follow a pattern.

files/ddr-pc-33-10/entity.json-100-                "id": "Community publications: Pacific Citizen:389"
files/ddr-pc-33-10/entity.json-88-                "id": "Civil rights:234"
files/ddr-pc-33-10/entity.json-92-                "id": "Associations and organizations: The Japanese American Citizens League:20"
files/ddr-pc-33-10/entity.json-96-                "id": "Conventions and conferences:299"
files/ddr-pc-33-11/entity.json-88-                "id": "Recreational activities: Sports: Bowling:316"

Shouldn't be too hard to write a script that hits each of these objects, extracts the (numeric) ID, gets the matching topic from the vocabs API, and then writes the object.

gjost commented 7 years ago

I'm prepping a modification to ddr-defs/(entity,segment)/json_load that should correct these problems.

gjost commented 7 years ago

ddr-defs: A temporary replacement for repo_models.entity.jsonload_topics that looks for topics exhibiting the two problem patterns and repairs them with data from the topics vocab API.
This fix is commit 7b37fc5 on the 043-repair-topics branch.

ddr-cmdln: Updated the ddr-transform command, adding options to filter by object ID (simple wildcard match) or by model. This would let you run ddr-transform on only certain objects in a collection, or only on certain models. This fix is commit b8141c7 in the master branch.

ddr-local: Updated the base template to display the ddr-defs path. I initially had trouble fixing this issue because my dev VM was using the wrong ddr-defs.

gjost commented 7 years ago

Looking at the repo's commit history to determine when the problem first occurred.

commit 0a74fc9cf8 2017-07-10 10:29 Caitlin Oiye Updated metadata file(s) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800 This commit modified 50 entity.json files in the repo. Looks like a ddr-transform run, except the commit is attributed to the archivist. This commit updates records to the new/current format. Topics are updated thusly:

@@ -80,67 +84,99 @@                                                                                                                                            
         "topics": [                                                                                                                                           
-            "Activism and involvement: Civil rights [234]",                                                                                                   
-            "Community activities: Associations and organizations: The Japanese American Citizens League [20]",                                               
-            "Community activities: Conventions and conferences [299]",                                                                                        
-            "Journalism and media: Community publications: Pacific Citizen [389]"                                                                             
+            {                                                                                                                                                 
+                "id": "234",                                                                                                                                  
+                "term": "Activism and involvement: Civil rights"                                                                                              
+            },                                                                                                                                                
+            {                                                                                                                                                 
+                "id": "20",                                                                                                                                   
+                "term": "Community activities: Associations and organizations: The Japanese American Citizens League"                                         
+            },                                                                                                                                                
+            {                                                                                                                                                 
+                "id": "299",                                                                                                                                  
+                "term": "Community activities: Conventions and conferences"                                                                                   
+            },                                                                                                                                                
+            {                                                                                                                                                 
+                "id": "389",                                                                                                                                  
+                "term": "Journalism and media: Community publications: Pacific Citizen"                                                                       
+            }                                                                                                                                                 

After this commit ~50 entity.json files were individually modified, from 10:29 to 10:51. Roughly 3 entities per minute, so looks like manual activity. The bad topics data is introduced in these commits. App and defs metadata is unchanged in this diff. ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800 Here is a representative commit:

@@ -85,16 +85,16 @@                                                                                                                                            
         "topics": [                                                                                                                                           
-                "id": "235",                                                                                                                                  
-                "term": "Activism and involvement: Politics"                                                                                                  
+                "id": "Politics:235",                                                                                                                         
+                "term": "Activism and involvement"                                                                                                            
-                "id": "43",                                                                                                                                   
-                "term": "Identity and values: Issei"                                                                                                          
+                "id": "Issei:43",                                                                                                                             
+                "term": "Identity and values"                                                                                                                 
-                "id": "389",                                                                                                                                  
-                "term": "Journalism and media: Community publications: Pacific Citizen"                                                                       
+                "id": "Community publications: Pacific Citizen:389",                                                                                          
+                "term": "Journalism and media"                                                                                                                

The next commit appears not to have changed much: f75ec61570 2017-07-11 16:33 Caitlin Oiye o [master] {origin/master} {origin/HEAD} Manual commit after ddr-transform Here is a representative diff:

diff --git a/files/ddr-pc-33-46/entity.json b/files/ddr-pc-33-46/entity.json                                                                                   
index 99f55da..053ebdd 100644                                                                                                                                  
--- a/files/ddr-pc-33-46/entity.json                                                                                                                           
+++ b/files/ddr-pc-33-46/entity.json                                                                                                                           
@@ -99,9 +99,7 @@                                                                                                                                              
-        "persons": [                                                                                                                                          
-            ""                                                                                                                                                
-        ]                                                                                                                                                     
+        "persons": []                                                                                                                                         
gjost commented 7 years ago

Some counter-examples:


NOTE: first commit introduces a new topic. Next commit strips the term ancestor but keeps the term ID in place. ddr-densho-332 commit 8cef9008e7 Philip Kikawa <> 2017-07-10 14:17:32 (PDT) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800

--- a/files/ddr-densho-332-64/entity.json
+++ b/files/ddr-densho-332-64/entity.json
@@ -84,7 +84,12 @@
         "rights_statement": ""
-        "topics": []
+        "topics": [
+            {
+                "id": "163",
+                "term": "Japan: Pre-World War II"
+            }
+        ]

ddr-densho-332 commit 6fc66a9b2e Philip Kikawa <> 2017-07-10 16:29:02 (PDT) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800

--- a/files/ddr-densho-332-64/entity.json
+++ b/files/ddr-densho-332-64/entity.json
@@ -87,7 +87,7 @@
         "topics": [
                 "id": "163",
-                "term": "Japan: Pre-World War II"
+                "term": "Pre-World War II"


NOTE: These commits seem to have fixed bad topics data.

ddr-densho-327 commit b853d71485 Caitlin Oiye <> 2017-07-03 13:24:01 (PDT) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800

--- a/files/ddr-densho-327-1/entity.json
+++ b/files/ddr-densho-327-1/entity.json
@@ -85,20 +85,20 @@
         "topics": [
-                "id": "Secondary education:335",
-                "term": "Education"
+                "id": "34",
+                "term": "Higher education"
-                "id": "Higher education:34",
-                "term": "Education"
+                "id": "32",
+                "term": "Public schools"
-                "id": "Public schools:32",
-                "term": "Education"
+                "id": "335",
+                "term": "Secondary education"
-                "id": "Concentration camps: Education:73",
-                "term": "World War II"
+                "id": "73",
+                "term": "Education"

ddr-densho-327 commit 879580e86a Caitlin Oiye <> 2017-07-03 13:31:00 (PDT) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800

--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -85,20 +85,20 @@
         "topics": [
-                "id": "Secondary education:335",
-                "term": "Education"
+                "id": "34",
+                "term": "Higher education"
-                "id": "Higher education:34",
-                "term": "Education"
+                "id": "32",
+                "term": "Public schools"
-                "id": "Public schools:32",
-                "term": "Education"
+                "id": "335",
+                "term": "Secondary education"
-                "id": "Concentration camps: Education:73",
-                "term": "World War II"
+                "id": "73",
+                "term": "Education"


NOTE: imported entities, adding (newstyle, properly formatted) topics to some of them. NOTE: topics only have the final term, no ancestors.

`ddr-densho-1000 commit e3d945fcf1 DDRAdmin 2017-07-01 13:37:32 (PDT) ddrcmdln:4ce6a00159 (master) 2017-06-26 11:44:00 -0700 ddrdefs:8c5e0b200f 2017-02-01 14:36:59 -0800`

gjost commented 7 years ago

Looking specifically at ddr-densho-327-2. Note that while ddr-cmdln is updated to master during this time, the same version of ddr-defs is used for all commits. Also, the problem appears to correct itself.

ddr-densho-327-2 commit 32de41e8dd DDRAdmin 2017-05-04 10:17:00 (PDT) ddrcmdln: 9d6d28fadd (develop) 2017-03-07 15:51:48 -0800 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800 Updates listofstrings data to listofdicts with separate term,id.

--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -80,10 +84,22 @@
         "topics": [
-            "Education: Secondary education [335]",
-            "Education: Higher education [34]",
-            "Education: Public schools [32]",
-            "World War II: Concentration camps: Education [73]"
+            {
+                "id": "335",
+                "term": "Education: Secondary education"
+            },
+            {
+                "id": "34",
+                "term": "Education: Higher education"
+            },
+            {
+                "id": "32",
+                "term": "Education: Public schools"
+            },
+            {
+                "id": "73",
+                "term": "World War II: Concentration camps: Education"
+            }

ddr-densho-327-2 commit 085b89ab11 Sara Beckman | 2017-06-28 11:35:14 (PDT) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800 Bad topics data introduced

--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -85,25 +85,27 @@
         "topics": [
-                "id": "335",
-                "term": "Education: Secondary education"
+                "id": "Secondary education:335",
+                "term": "Education"
-                "id": "34",
-                "term": "Education: Higher education"
+                "id": "Higher education:34",
+                "term": "Education"
-                "id": "32",
-                "term": "Education: Public schools"
+                "id": "Public schools:32",
+                "term": "Education"
-                "id": "73",
-                "term": "World War II: Concentration camps: Education"
+                "id": "Concentration camps: Education:73",
+                "term": "World War II"

ddr-densho-327-2 commit 879580e86a Caitlin Oiye 2017-07-03 13:31:00 (PDT) ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700 ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800 Bad topics data fixed

--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -85,20 +85,20 @@
         "topics": [
-                "id": "Secondary education:335",
-                "term": "Education"
+                "id": "34",
+                "term": "Higher education"
-                "id": "Higher education:34",
-                "term": "Education"
+                "id": "32",
+                "term": "Public schools"
-                "id": "Public schools:32",
-                "term": "Education"
+                "id": "335",
+                "term": "Secondary education"
-                "id": "Concentration camps: Education:73",
-                "term": "World War II"
+                "id": "73",
+                "term": "Education"
gjost commented 6 years ago

This should have been fixed by b6de27a.