Closed gjost closed 7 years ago
Reconstruction of last several changes to ddr-pc-33-15
.
commit 1c921a6 (Thu Feb 12 16:06:46 2015 -0700)
...
{
"topics": [
"Geographic communities: Hawai'i [277]",
"Journalism and media: Community publications: Pacific Citizen [389]",
"Race and racism [36]",
"Race and racism: Cross-racial relations [38]",
"Race and racism: Discrimination [37]"
]
},
...
commit 26b61ec (Wed Feb 1 10:27:08 2017 -0800)
...
{
"topics": [
{
"id": "277",
"term": "i"
},
{
"id": "389",
"term": "Journalism and media: Community publications: Pacific Citizen"
},
{
"id": "36",
"term": "Race and racism"
},
{
"id": "38",
"term": "Race and racism: Cross-racial relations"
},
{
"id": "37",
"term": "Race and racism: Discrimination"
}
]
},
...
If I load and write ddr-pc-33-15
(commit 26b61ec) using the current ddr-cmdln
(master branch, commit 2b6f929)
(ddrlocal)ddr@denshodeb8:/usr/local/src/ddr-local/ddrlocal$ python manage.py shell
>>> from DDR import identifier
>>> e = identifier.Identifier(id='ddr-pc-33-15', base_path='/var/www/media/ddr').object()
>>> for t in e.topics:
... print t
...
{u'term': u'i', u'id': u'277'}
{u'term': u'Journalism and media: Community publications: Pacific Citizen', u'id': u'389'}
{u'term': u'Race and racism', u'id': u'36'}
{u'term': u'Race and racism: Cross-racial relations', u'id': u'38'}
{u'term': u'Race and racism: Discrimination', u'id': u'37'}
>>> e.write_json()
This last command does not change the topics
data at all. After the write is in the same state.
If I modify the entity.json
to its state in commit 1c921a6 (top of this comment) and then load and write:
(ddrlocal)ddr@denshodeb8:/usr/local/src/ddr-local/ddrlocal$ python manage.py shell
>>> from DDR import identifier
>>> e = identifier.Identifier(id='ddr-pc-33-15', base_path='/var/www/media/ddr').object()
>>> for t in e.topics:
... print t
...
{'term': u"Geographic communities: Hawai'i", 'id': u'277'}
{'term': u'Journalism and media: Community publications: Pacific Citizen', 'id': u'389'}
{'term': u'Race and racism', 'id': u'36'}
{'term': u'racial relations', 'id': u'38'}
{'term': u'Race and racism: Discrimination', 'id': u'37'}
>>> e.write_json()
The resulting entity.json
:
...
{
"topics": [
{
"id": "277",
"term": "Geographic communities: Hawai'i"
},
{
"id": "389",
"term": "Journalism and media: Community publications: Pacific Citizen"
},
{
"id": "36",
"term": "Race and racism"
},
{
"id": "38",
"term": "racial relations"
},
{
"id": "37",
"term": "Race and racism: Discrimination"
}
]
},
...
I can't see how the file changed to the state it's in in the description up top.
FYI, spent some time looking at the entity.json at the top:
<<<<<<< HEAD
"app_commit": "00d6bf004a20c921f921fa5f28616ce642a51958 (HEAD, tag: v2.0, origin/master, origin/HEAD, master) 2017-05-03 11:27:32 -0700",
...
"models_commit": "8c5e0b200fe5f02c9216fd4bc3be42d46d881cf5 2017-02-01 14:36:59 -0800"
=======
app_commit
is the version of ddr-cmdln
that was used to load and write the file.
models_commit
is the version of ddr-defs
that was used to load/write.
The DDR editor is actually two applications (ddr-local
and ddr-cmdln
) plus some configurations (ddr-defs
). It's important that all of these match up. Except for one-off hotfixes, if I make a branch of one of these projects I always try to create identically-named branches in the other two projects.
If you run the latest master
of ddr-local
and the develop
branch of ddr-cmdln
things may not work correctly. If your ddr-cmdln
is from May 2017 and your ddr-defs
is from February there might be problems.
In this case it looks like ddr-cmdln
and ddr-defs
are mismatched.
So much for that theory. I checked out the mismatched ddr-cmdln
and ddr-defs
and tried loading/writing ddr-pc-33-15
starting with both of the states above (commits 1c921a6 and 26b61ec). It munged the "Hawai'i" term but all the other terms/IDs behaved as expected. In other words, I was unable to duplicate the diff at the top.
Current state of ddr-pc-33-15
on kinkura/gold
, commit f75ec61:
...
{
"topics": [
{
"id": "277",
"term": "i"
},
{
"id": "Community publications: Pacific Citizen:389",
"term": "Journalism and media"
},
{
"id": "36",
"term": "Race and racism"
},
{
"id": "Cross-racial relations:38",
"term": "Race and racism"
},
{
"id": "Discrimination:37",
"term": "Race and racism"
}
]
},
...
FWIW, I made a list of all the messed-up topic IDs in ddr-pc-33
. The good news is it's a finite list and the messed-up IDs follow a pattern.
files/ddr-pc-33-10/entity.json-100- "id": "Community publications: Pacific Citizen:389"
files/ddr-pc-33-10/entity.json-88- "id": "Civil rights:234"
files/ddr-pc-33-10/entity.json-92- "id": "Associations and organizations: The Japanese American Citizens League:20"
files/ddr-pc-33-10/entity.json-96- "id": "Conventions and conferences:299"
files/ddr-pc-33-11/entity.json-88- "id": "Recreational activities: Sports: Bowling:316"
Shouldn't be too hard to write a script that hits each of these objects, extracts the (numeric) ID, gets the matching topic from the vocabs API, and then writes the object.
I'm prepping a modification to ddr-defs/(entity,segment)/json_load
that should correct these problems.
ddr-defs
: A temporary replacement for repo_models.entity.jsonload_topics
that looks for topics exhibiting the two problem patterns and repairs them with data from the topics vocab API.
This fix is commit 7b37fc5
on the 043-repair-topics
branch.
ddr-cmdln
: Updated the ddr-transform
command, adding options to filter by object ID (simple wildcard match) or by model. This would let you run ddr-transform
on only certain objects in a collection, or only on certain models. This fix is commit b8141c7
in the master
branch.
ddr-local
: Updated the base template to display the ddr-defs
path. I initially had trouble fixing this issue because my dev VM was using the wrong ddr-defs
.
Looking at the repo's commit history to determine when the problem first occurred.
commit 0a74fc9cf8 2017-07-10 10:29 Caitlin Oiye Updated metadata file(s)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
This commit modified 50 entity.json files in the repo. Looks like a ddr-transform
run, except the commit is attributed to the archivist. This commit updates records to the new/current format. Topics are updated thusly:
@@ -80,67 +84,99 @@
},
{
"topics": [
- "Activism and involvement: Civil rights [234]",
- "Community activities: Associations and organizations: The Japanese American Citizens League [20]",
- "Community activities: Conventions and conferences [299]",
- "Journalism and media: Community publications: Pacific Citizen [389]"
+ {
+ "id": "234",
+ "term": "Activism and involvement: Civil rights"
+ },
+ {
+ "id": "20",
+ "term": "Community activities: Associations and organizations: The Japanese American Citizens League"
+ },
+ {
+ "id": "299",
+ "term": "Community activities: Conventions and conferences"
+ },
+ {
+ "id": "389",
+ "term": "Journalism and media: Community publications: Pacific Citizen"
+ }
]
},
After this commit ~50 entity.json files were individually modified, from 10:29 to 10:51. Roughly 3 entities per minute, so looks like manual activity. The bad topics data is introduced in these commits.
App and defs metadata is unchanged in this diff.
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
Here is a representative commit:
@@ -85,16 +85,16 @@
{
"topics": [
{
- "id": "235",
- "term": "Activism and involvement: Politics"
+ "id": "Politics:235",
+ "term": "Activism and involvement"
},
{
- "id": "43",
- "term": "Identity and values: Issei"
+ "id": "Issei:43",
+ "term": "Identity and values"
},
{
- "id": "389",
- "term": "Journalism and media: Community publications: Pacific Citizen"
+ "id": "Community publications: Pacific Citizen:389",
+ "term": "Journalism and media"
}
]
},
The next commit appears not to have changed much:
f75ec61570 2017-07-11 16:33 Caitlin Oiye o [master] {origin/master} {origin/HEAD} Manual commit after ddr-transform
Here is a representative diff:
diff --git a/files/ddr-pc-33-46/entity.json b/files/ddr-pc-33-46/entity.json
index 99f55da..053ebdd 100644
--- a/files/ddr-pc-33-46/entity.json
+++ b/files/ddr-pc-33-46/entity.json
@@ -99,9 +99,7 @@
]
},
{
- "persons": [
- ""
- ]
+ "persons": []
},
{
Some counter-examples:
NOTE: first commit introduces a new topic. Next commit strips the term ancestor but keeps the term ID in place.
ddr-densho-332 commit 8cef9008e7 Philip Kikawa <philip.kikawa@densho.org> 2017-07-10 14:17:32 (PDT)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
--- a/files/ddr-densho-332-64/entity.json
+++ b/files/ddr-densho-332-64/entity.json
@@ -84,7 +84,12 @@
"rights_statement": ""
},
{
- "topics": []
+ "topics": [
+ {
+ "id": "163",
+ "term": "Japan: Pre-World War II"
+ }
+ ]
},
ddr-densho-332 commit 6fc66a9b2e Philip Kikawa <philip.kikawa@densho.org> 2017-07-10 16:29:02 (PDT)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
--- a/files/ddr-densho-332-64/entity.json
+++ b/files/ddr-densho-332-64/entity.json
@@ -87,7 +87,7 @@
"topics": [
{
"id": "163",
- "term": "Japan: Pre-World War II"
+ "term": "Pre-World War II"
}
]
},
NOTE: These commits seem to have fixed bad topics data.
ddr-densho-327 commit b853d71485 Caitlin Oiye <caitlin.oiye@densho.org> 2017-07-03 13:24:01 (PDT)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
--- a/files/ddr-densho-327-1/entity.json
+++ b/files/ddr-densho-327-1/entity.json
@@ -85,20 +85,20 @@
{
"topics": [
{
- "id": "Secondary education:335",
- "term": "Education"
+ "id": "34",
+ "term": "Higher education"
},
{
- "id": "Higher education:34",
- "term": "Education"
+ "id": "32",
+ "term": "Public schools"
},
{
- "id": "Public schools:32",
- "term": "Education"
+ "id": "335",
+ "term": "Secondary education"
},
{
- "id": "Concentration camps: Education:73",
- "term": "World War II"
+ "id": "73",
+ "term": "Education"
}
]
},
ddr-densho-327 commit 879580e86a Caitlin Oiye <caitlin.oiye@densho.org> 2017-07-03 13:31:00 (PDT)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -85,20 +85,20 @@
{
"topics": [
{
- "id": "Secondary education:335",
- "term": "Education"
+ "id": "34",
+ "term": "Higher education"
},
{
- "id": "Higher education:34",
- "term": "Education"
+ "id": "32",
+ "term": "Public schools"
},
{
- "id": "Public schools:32",
- "term": "Education"
+ "id": "335",
+ "term": "Secondary education"
},
{
- "id": "Concentration camps: Education:73",
- "term": "World War II"
+ "id": "73",
+ "term": "Education"
}
]
},
NOTE: imported entities, adding (newstyle, properly formatted) topics to some of them. NOTE: topics only have the final term, no ancestors.
`ddr-densho-1000 commit e3d945fcf1 DDRAdmin maunakea@hq.densho.org 2017-07-01 13:37:32 (PDT) ddrcmdln:
4ce6a00159 (master) 2017-06-26 11:44:00 -0700 ddrdefs:
8c5e0b200f 2017-02-01 14:36:59 -0800`
Looking specifically at ddr-densho-327-2.
Note that while ddr-cmdln
is updated to master
during this time, the same version of ddr-defs
is used for all commits. Also, the problem appears to correct itself.
ddr-densho-327-2 commit 32de41e8dd DDRAdmin maunakea@hq.densho.org 2017-05-04 10:17:00 (PDT)
ddrcmdln: 9d6d28fadd (develop) 2017-03-07 15:51:48 -0800
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
Updates listofstrings data to listofdicts with separate term,id.
--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -80,10 +84,22 @@
},
{
"topics": [
- "Education: Secondary education [335]",
- "Education: Higher education [34]",
- "Education: Public schools [32]",
- "World War II: Concentration camps: Education [73]"
+ {
+ "id": "335",
+ "term": "Education: Secondary education"
+ },
+ {
+ "id": "34",
+ "term": "Education: Higher education"
+ },
+ {
+ "id": "32",
+ "term": "Education: Public schools"
+ },
+ {
+ "id": "73",
+ "term": "World War II: Concentration camps: Education"
+ }
]
},
ddr-densho-327-2 commit 085b89ab11 Sara Beckman sara.beckman@densho.org | 2017-06-28 11:35:14 (PDT)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
Bad topics data introduced
--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -85,25 +85,27 @@
{
"topics": [
{
- "id": "335",
- "term": "Education: Secondary education"
+ "id": "Secondary education:335",
+ "term": "Education"
},
{
- "id": "34",
- "term": "Education: Higher education"
+ "id": "Higher education:34",
+ "term": "Education"
},
{
- "id": "32",
- "term": "Education: Public schools"
+ "id": "Public schools:32",
+ "term": "Education"
},
{
- "id": "73",
- "term": "World War II: Concentration camps: Education"
+ "id": "Concentration camps: Education:73",
+ "term": "World War II"
}
]
},
ddr-densho-327-2 commit 879580e86a Caitlin Oiye caitlin.oiye@densho.org 2017-07-03 13:31:00 (PDT)
ddrcmdln: 00d6bf004a (master) 2017-05-03 11:27:32 -0700
ddrdefs: 8c5e0b200f 2017-02-01 14:36:59 -0800
Bad topics data fixed
--- a/files/ddr-densho-327-2/entity.json
+++ b/files/ddr-densho-327-2/entity.json
@@ -85,20 +85,20 @@
{
"topics": [
{
- "id": "Secondary education:335",
- "term": "Education"
+ "id": "34",
+ "term": "Higher education"
},
{
- "id": "Higher education:34",
- "term": "Education"
+ "id": "32",
+ "term": "Public schools"
},
{
- "id": "Public schools:32",
- "term": "Education"
+ "id": "335",
+ "term": "Secondary education"
},
{
- "id": "Concentration camps: Education:73",
- "term": "World War II"
+ "id": "73",
+ "term": "Education"
}
]
},
This should have been fixed by b6de27a.
Additional context, from the
entity.json
file this is taken from: