Open saracarl opened 2 months ago
This seems like it was introduced recently, so I went looking for PRs that might account for it.
This one seemed most likely.
Looking at the Secretaries
example, it looks like the Omeka site has a Secretaries article with S95466
in the URL, which should correspond to a subject/article ID in FromThePage.
In my old development copy of the db, the article exists:
2.7.3 :001 > Article.find 95466
Creating scope :target_article_links. Overwriting existing method Article.target_article_links.
Creating scope :page_article_links. Overwriting existing method Article.page_article_links.
Article Load (0.4ms) SELECT `articles`.* FROM `articles` WHERE `articles`.`id` = 95466 LIMIT 1
=> #<Article id: 95466, title: "Secretaries", source_text: "“Someone who works in an office, writing letters, ...", created_on: "2021-02-24 19:21:27.000000000 +0000", lock_version: 1, xml_text: "<?xml version='1.0' encoding='UTF-8'?> \n <...", graph_image: "/home/fromthepage/deployment/releases/202307261555...", collection_id: 981, latitude: nil, longitude: nil, uri: "https://dictionary.cambridge.org/us/dictionary/eng...", provenance: nil, created_by_id: 215593, pages_count: 1549>
However in production, this record does not exist:
2.7.3 :001 > Article.find 95466
Traceback (most recent call last):
1: from (irb):1
ActiveRecord::RecordNotFound (Couldn't find Article with 'id'=95466)
Note that my local database has had the orphan article clean-up script run on it.
The new Secretaries article was created on August 8, which gives us a terminus ante quem date for this problem. Looking for discontinuities in dates might give us more specific time-frames.
2.7.3 :002 > Article.find 32157202
=> #<Article id: 32157202, title: "Secretaries", source_text: nil, created_on: "2024-08-08 17:15:26.000000000 +0000", lock_version: 0, xml_text: nil, graph_image: "/home/fromthepage/deployment/releases/202408072043...", collection_id: 981, latitude: nil, longitude: nil, uri: nil, provenance: nil, created_by_id: 32018409, pages_count: 205>
The page_article_link
records seem to reflect the new subject -- there are no orphan records pointing the old subject id
2.7.3 :003 > PageArticleLink.where(article_id: 95466).count
=> 0
2.7.3 :004 > PageArticleLink.where(article_id: 32157202).count
=> 205
However, we should be able to look for the old ID in the XML text of some of these pages.
There are still pages in production whose xml_text
refers to the old article id:
2.7.3 :011 > c.pages.where("xml_text like '%target_id=\\'95466\\'%'").count
=> 2359
Example page pointing to the old Secretaries record: https://fromthepage.com/cwrgm/cwrgm-rev2/voucher-to-southwestern-telegraph-company-january-2-1864/display/32180310
(This has had its page_article_links cleaned out so the record does not exist in it.)
Looking at the WWP, we see this:
2.7.3 :001 > page = Page.find 33951039
=> #<Page id: 33951039, title: "page_0003", source_text: "I presume that an\r\nordinary bishop's recommend\r\nwi......
I presume that an
ordinary bishop's recommend
will fill the bill for cre-
dentials. Is this so; and is it
necessary for the recommend
to be endorsed by yourself, or
will it be sufficient for him
to present it at the [[Brigham Young Academy, Provo, Utah County, Utah Territory|academy]].
Hoping you can reply
soon
I remain
Your brother
In the Gospel
[[Levi Mathers Savage|L. M. Savage]]
Bishop.
I think it will be all right for
him to come. Send his credentials
direct to the Rep. and see if
bro. [[William Charles Spence|Spence]] can do anything
for his <hi rend="underline">fare</hi>, and let them know.
[[Joseph Fielding Smith|J. F. S.]] => nil
<?xml version='1.0' encoding='UTF-8'?>
<p>I presume that an<lb/>ordinary bishop's recommend<lb/>will fill the bill for cre<lb break='no'/>dentials. Is thto present it at the <link link_id='39303239' target_id='32011034' target_title='Brigham Young Academy, Provo, Utah County, Utah Territory'>academy</link>.</p><p>Hoping you can reply<lb/>soon</p><p>I remain<lb/>Your brother<lb/>In the Gospel</p><p><link link_id='39303240' target_id='32013315' target_title='Levi Mathers Savage'>L. M. Savage</link><lb/>Bishop.</p><pink_id='39303241' target_id='32030321' target_title='William Charles Spence'>Spence</link> can do anything<lb/>for his <hi rend='underline'>fare</hi>, and let them know.</p><p><link link_id='39303242' target_id='32157344' target_title='Joseph Fielding Smith'>J. F. S.</link></p>
</page>
2.7.3 :004 > a = Article.find 32157344
Traceback (most recent call last):
1: from (irb):4
ActiveRecord::RecordNotFound (Couldn't find Article with 'id'=32157344)
2.7.3 :005 > page.page_article_links.last
=> #<PageArticleLink id: 39303241, page_id: 33951039, article_id: 32030321, display_text: "Spence", created_on: "2024-08-21 15:04:51.000000000 +0000", text_type: "transcription">
2.7.3 :006 > page.page_article_links
, display_text: "academy", created_on: "2024-08-21 15:04:51.000000000 +0000", text_type: "transcription">, #<PageArticleLi00000000 +0000", text_type: "transcription">, #<PageArticleLink id: 39303241, page_id: 33951039, article_id: 32030321, display_text: "Spence", created_on: "2024-08-21 15:04:51.000000000 +0000", text_type: "transcription">]>
2.7.3 :007 > pp page.page_article_links
[#<PageArticleLink:0x000063bf04edcb88
id: 39303239,
page_id: 33951039,
article_id: 32011034,
display_text: "academy",
created_on: Wed, 21 Aug 2024 15:04:51.000000000 UTC +00:00,
text_type: "transcription">,
#<PageArticleLink:0x000063bf04edc8e0
id: 39303240,
page_id: 33951039,
article_id: 32013315,
display_text: "L. M. Savage",
created_on: Wed, 21 Aug 2024 15:04:51.000000000 UTC +00:00,
text_type: "transcription">,
#<PageArticleLink:0x000063bf04edc570
id: 39303241,
page_id: 33951039,
article_id: 32030321,
display_text: "Spence",
created_on: Wed, 21 Aug 2024 15:04:51.000000000 UTC +00:00,
text_type: "transcription">]
=> #<ActiveRecord::Associations::CollectionProxy [#<PageArticleLink id: 39303239, page_id: 33951039, article_id: 32011034, display_text: "academy", created_on: "2024-08-21 15:04:51.000000000 +0000", text_type: "transcription">, #<PageArticleLink id: 39303240, page_id: 33951039, article_id: 32013315, display_text: "L. M. Savage", created_on: "2024-08-21 15:04:51.000000000 +0000", text_type: "transcription">, #<PageArticleLink id: 39303241, page_id: 33951039, article_id: 32030321, display_text: "Spence", created_on: "2024-08-21 15:04:51.000000000 +0000", text_type: "transcription">]>
2.7.3 :008 > c = page.collection
=> #<Collection id: 970, title: "Wilford Woodruff Papers Project", owner_user_id: 221669, created_on: "2020-07-27 2...
2.7.3 :009 > c.articles.where(title: 'Joseph Fielding Smith').count
=> 1
2.7.3 :010 > c.articles.where(title: 'Joseph Fielding Smith').first
=> #<Article id: 32159119, title: "Joseph Fielding Smith", source_text: nil, created_on: "2024-08-21 15:10:52.000000000 +0000", lock_version: 0, xml_text: nil, graph_image: "/home/fromthepage/deployment/releases/202408072043...", collection_id: 970, latitude: nil, longitude: nil, uri: nil, provenance: nil, created_by_id: 32023688, pages_count: 21>
2.7.3 :011 > c.articles.where(title: 'Joseph Fielding Smith').first.user
Traceback (most recent call last):
1: from (irb):11
NoMethodError (undefined method `user' for #<Article:0x000063bf066c7888>)
Did you mean? super
2.7.3 :012 > c.articles.where(title: 'Joseph Fielding Smith').first.created_by
Traceback (most recent call last):
2: from (irb):11
1: from (irb):12:in `rescue in irb_binding'
NoMethodError (undefined method `created_by' for #<Article:0x000063bf05ac9ce8>)
Did you mean? created_by_id
created_on
created_on?
created_on=
2.7.3 :013 > c.articles.where(title: 'Joseph Fielding Smith').first.created_by_id
=> 32023688
2.7.3 :014 > User.find 32023688
=> #<User id: 32023688, login: "andecarson", display_name: "andecarson", real_name: "Carson Andersen", email: "carson.andersen@wilfordwoodruffpapers.org", owner: false, admin: false, created_at: "2024-04-26 15:06:42.000000000 +0000", updated_at: "2024-08-22 14:08:24.000000000 +0000", remember_token_expires_at: nil, location: nil, website: nil, about: nil, account_type: nil, paid_date: nil, guest: nil, slug: "andecarson", deleted: false, provider: nil, uid: nil, start_date: nil, orcid: nil, dictation_language: "en-US", activity_email: true, external_id: nil, sso_issuer: nil, preferred_locale: nil, api_key: nil, picture: nil, help: nil, footer_block: "For questions about this project, contact at.">
We can see when a replacement subject was created, and by whom. All three of CWRGM's were created by a page edit by Alessandra Diaz on a page save at 11:15 central time on August 8th of the following page:
https://fromthepage.com/cwrgm/cwrgm-rev2/letter-from-j-w-piles-to-the-mississippi-state-board-of-registration-august-28-1876/transcribe/34048658
and here's the back end -- the show/display of that page:
What we see when we look at the versions is a save of the transcription, a save with subjects linking to the "old" instance of the subject, then a save with the subjects linking to the "new" instances of the subject, all within an 11 minute time frame, which is all very weird.
Here's the versions tab: https://fromthepage.com/cwrgm/cwrgm-rev2/letter-from-j-w-piles-to-the-mississippi-state-board-of-registration-august-28-1876/versions/34048658 The one we're most interested in is the save where she changes the State of Mississippi link from "Mississippi--Executive Office" to "Mississippi--Executive Department".
Based on this, what we think happened is that she linked "Mississippi--Executive Office", was asked to categorize it, realized she got it wrong, hit "cancel" on the subject categorization page, which kicked us into this new code. That cancellation is supposed to delete the new/abandoned subject, but what we think is happening is that all three of the subjects on the page are being deleted. When she re-saves the page after correcting the mis-link, it recreates all three of the subjects. This matches what we are seeing in their list of orphaned subjects.
Based on the creation dates on subjects in this spreadsheet (i.e. Fruita), the problem is introduced -- or subjects are recreated -- on an edit where a link is removed. Here, it's Latter Day Saints
Log messages from investigating with tripwire code:
I, [2024-10-22T23:18:46.524817 #2021610] INFO -- : Started GET "/woodruff/wilford-woodruff-papers-project/letter-from-thomas-edwin-ricks-james-henry-har
t-and-joseph-coulson-rich-2-june-1890-le-34727/transcribe/34242591" for 63.225.197.57 at 2024-10-22 23:18:46 +0000
I, [2024-10-22T23:18:46.527225 #2021610] INFO -- : Parameters: {"user_slug"=>"woodruff", "collection_id"=>"wilford-woodruff-papers-project", "work_id"
=>"letter-from-thomas-edwin-ricks-james-henry-hart-and-joseph-coulson-rich-2-june-1890-le-34727", "page_id"=>"34242591"}
I, [2024-10-22T23:18:47.322653 #2021561] INFO -- : Started GET "/marindasmith/970/32133297/still_editing/34242591" for 63.225.197.57 at 2024-10-22 23:18
:47 +0000
I, [2024-10-22T23:18:47.325288 #2021561] INFO -- : Processing by TranscribeController#still_editing as */*
I, [2024-10-22T23:18:47.325514 #2021561] INFO -- : Parameters: {"user_slug"=>"marindasmith", "collection_id"=>"970", "work_id"=>"32133297", "page_id"=
>"34242591"}
I, [2024-10-22T23:19:03.019410 #2021537] INFO -- : Started GET "/woodruff/970/32133297/34242591/active_editing" for 63.225.197.57 at 2024-10-22 23:19:03
+0000
I, [2024-10-22T23:19:03.019381 #2021660] INFO -- : Started GET "/page_version/show?page_version_id=34957454" for 44.214.187.82 at 2024-10-22 23:19:03 +0
000
I, [2024-10-22T23:19:03.021540 #2021537] INFO -- : Processing by TranscribeController#active_editing as */*
I, [2024-10-22T23:19:03.021674 #2021635] INFO -- : Rendered collection/show.html.slim within layouts/application (Duration: 153.0ms | Allocations: 626
67)
I, [2024-10-22T23:19:03.021733 #2021537] INFO -- : Parameters: {"user_slug"=>"woodruff", "collection_id"=>"970", "work_id"=>"32133297", "page_id"=>"34
242591"}
I, [2024-10-22T23:19:47.457047 #2021585] INFO -- : Started GET "/marindasmith/970/32133297/still_editing/34242591" for 63.225.197.57 at 2024-10-22 23:19
:47 +0000
I, [2024-10-22T23:19:47.459027 #2021585] INFO -- : Processing by TranscribeController#still_editing as */*
I, [2024-10-22T23:19:47.459216 #2021585] INFO -- : Parameters: {"user_slug"=>"marindasmith", "collection_id"=>"970", "work_id"=>"32133297", "page_id"=
>"34242591"}
I, [2024-10-22T23:20:41.624364 #2021685] INFO -- : Started PATCH "/woodruff/wilford-woodruff-papers-project/review/one_off/34242591" for 63.225.197.57 a
t 2024-10-22 23:20:41 +0000
I, [2024-10-22T23:20:41.626665 #2021685] INFO -- : Processing by TranscribeController#save_transcription as HTML
I, [2024-10-22T23:20:41.626898 #2021685] INFO -- : Parameters: {"authenticity_token"=>"rzQtCLBbPYpzVF4JRbvdPlsSeeEip40+APdYAzF+F6uJsU/X0j2gB4cDB36QpvG
sPGYuOKgyyNSWEC/Aw2NJwA==", "page_id"=>"34242591", "flow"=>"", "quality_sampling_id"=>"", "page"=>{"mark_blank"=>"0", "source_text"=>"Charles H. Hart.\r\
n\r\nLAND BUSINESS.\r\nREAL ESTATE.\r\nCOLLECTIONS.\r\n\r\nHart & Son,\r\nATTORNEYS AT LAW.\r\nOffices in Court House and on Main Street.\r\n\r\nParis, B
ear Lake Co., Idaho, 1890.\r\n\r\nuntil the Briggs case had been heard and disposed \r\nof. The matter will not be presented until the signs \r\nare more
favorable.\r\n\r\nNothing worthy of special notice more that \r\nthat already mentioned has transpired since we wrote \r\nto you. The Grand Jury is stil
l in Session—Col.\r\nJones of the [[Blackfoot, Bingham County, Idaho Territory|Blackfoot]] News told us sub rosa that \r\n56 indictments had been matured
to-day, all of \r\nthem growing out of the election business—he thought \r\nthe object was political capital manufactured to keep \r\nthe old Anti-Mormo
n party alive a little longer— \r\nbut he thought the issue in southern [[Idah Territory|Idaho]] was \r\ndead and could not be made to serve another camp
aign.\r\n\r\nPraying the Lord to continue to bless and strengthen \r\nyour in your arduous labors.\r\n\r\nWe remain as ever \r\nYour brethren in the Gos
pel\r\n\r\n[[Thomas Edwin Ricks|T. E. Ricks]]\r\n[[James Henry Hart|James H. Hart]].\r\n[[Joseph Coulson Rich|J. C. Rich]]."}, "save_to_needs_review"=>""
, "filter-brightness"=>"0", "filter-contrast"=>"0", "filter-threshold"=>"0", "user_slug"=>"woodruff", "collection_id"=>"wilford-woodruff-papers-project"}
I, [2024-10-22T23:20:41.648529 #2021685] INFO -- : TRANSCRIPTION 2024-10-22 23:20:41 +0000
TRANSCRIPTION User ID: 25091055 Email: marinda@amqc.net Display Name: marindasmith
TRANSCRIPTION Collection ID: 970 Title:Wilford Woodruff Papers Project Owner Email: contact@wilfordwoodruffpapers.org
TRANSCRIPTION Work ID: 32133297 Title: Letter from Thomas Edwin Ricks, James Henry Hart, and Joseph Coulson Rich, 2 June 1890 [LE-34727]
TRANSCRIPTION Page ID: 34242591 Position: 3 Title:page_0003
TRANSCRIPTION Source Text:
BEGIN_SOURCE_TEXT
Charles H. Hart.
LAND BUSINESS.
REAL ESTATE.
COLLECTIONS.
Hart & Son,
ATTORNEYS AT LAW.
Offices in Court House and on Main Street.
Paris, Bear Lake Co., Idaho, 1890.
until the Briggs case had been heard and disposed
of. The matter will not be presented until the signs
are more favorable.
Nothing worthy of special notice more that
that already mentioned has transpired since we wrote
to you. The Grand Jury is still in Session—Col.
Jones of the [[Blackfoot, Bingham County, Idaho Territory|Blackfoot]] News told us sub rosa that
56 indictments had been matured to-day, all of
them growing out of the election business—he thought
the object was political capital manufactured to keep
the old Anti-Mormon party alive a little longer—
but he thought the issue in southern [[Idah Territory|Idaho]] was
dead and could not be made to serve another campaign.
Praying the Lord to continue to bless and strengthen
your in your arduous labors.
We remain as ever
Your brethren in the Gospel
[[Thomas Edwin Ricks|T. E. Ricks]]
[[James Henry Hart|James H. Hart]].
[[Joseph Coulson Rich|J. C. Rich]].
END_SOURCE_TEXT
I, [2024-10-22T23:20:41.655650 #2021685] INFO -- : ISSUE4269 old_article_count = 27703
I, [2024-10-22T23:20:46.184064 #2021685] INFO -- : Redirected to https://fromthepage.com/transcribe/assign_categories?collection_id=wilford-woodruff-pap
ers-project&next_page_id=34242591&page_id=34242591
I, [2024-10-22T23:20:46.324265 #2021635] INFO -- : Started GET "/transcribe/assign_categories?collection_id=wilford-woodruff-papers-project&next_page_id
=34242591&page_id=34242591" for 63.225.197.57 at 2024-10-22 23:20:46 +0000
I, [2024-10-22T23:20:46.327060 #2021635] INFO -- : Processing by TranscribeController#assign_categories as HTML
I, [2024-10-22T23:20:46.327270 #2021635] INFO -- : Parameters: {"collection_id"=>"wilford-woodruff-papers-project", "next_page_id"=>"34242591", "page_
id"=>"34242591"}
I, [2024-10-22T23:20:52.859002 #2021561] INFO -- : Started GET "/woodruff/wilford-woodruff-papers-project/letter-from-thomas-edwin-ricks-james-henry-har
t-and-joseph-coulson-rich-2-june-1890-le-34727/transcribe/34242591?rollback_delete_ids%5B%5D=32028809&rollback_delete_ids%5B%5D=32167091&rollback_delete_
ids%5B%5D=32001535&rollback_delete_ids%5B%5D=32139631&rollback_delete_ids%5B%5D=32049552&rollback_unset_ids%5B%5D=32167091" for 63.225.197.57 at 2024-10-
22 23:20:52 +0000
I, [2024-10-22T23:20:52.861409 #2021561] INFO -- : Processing by TranscribeController#display_page as HTML
I, [2024-10-22T23:20:52.861644 #2021561] INFO -- : Parameters: {"rollback_delete_ids"=>["32028809", "32167091", "32001535", "32139631", "32049552"], "
rollback_unset_ids"=>["32167091"], "user_slug"=>"woodruff", "collection_id"=>"wilford-woodruff-papers-project", "work_id"=>"letter-from-thomas-edwin-rick
s-james-henry-hart-and-joseph-coulson-rich-2-june-1890-le-34727", "page_id"=>"34242591"}
I, [2024-10-22T23:20:52.873796 #2021561] INFO -- : Rendered inline template (Duration: 0.4ms | Allocations: 76)
W, [2024-10-22T23:20:53.211549 #2021561] WARN -- : ISSUE4269 Warning: Article 32001535 Thomas Edwin Ricks in collection Wilford Woodruff Papers Project
is being destroyed.
W, [2024-10-22T23:20:53.245121 #2021561] WARN -- : ISSUE4269 Warning: Article 32028809 Blackfoot, Bingham County, Idaho Territory in collection Wilford
Woodruff Papers Project is being destroyed.
W, [2024-10-22T23:20:53.265932 #2021561] WARN -- : ISSUE4269 Warning: Article 32049552 Joseph Coulson Rich in collection Wilford Woodruff Papers Project
is being destroyed.
W, [2024-10-22T23:20:53.316457 #2021561] WARN -- : ISSUE4269 Warning: Article 32139631 James Henry Hart in collection Wilford Woodruff Papers Project is
being destroyed.
W, [2024-10-22T23:20:53.325134 #2021561] WARN -- : ISSUE4269 Warning: Article 32167091 Idah Territory in collection Wilford Woodruff Papers Project is b
eing destroyed.
I, [2024-10-22T23:20:53.563111 #2021561] INFO -- : Rendered transcribe/display_page.html.slim within layouts/application (Duration: 233.4ms | Allocati
ons: 323903)
I, [2024-10-22T23:20:53.573706 #2021561] INFO -- : Rendered layout layouts/application.html.slim (Duration: 244.4ms | Allocations: 333234)
I, [2024-10-22T23:20:53.580009 #2021561] INFO -- : ISSUE4269 WARNING 27704 > 27699 at transcribe#display_page
I, [2024-10-22T23:20:53.585531 #2021561] INFO -- : Completed 200 OK in 724ms (Views: 180.8ms | ActiveRecord: 199.8ms | Allocations: 632011)
I, [2024-10-22T23:20:53.585786 #2021561] INFO -- : Oink Action: transcribe#display_page
I, [2024-10-22T23:20:53.585967 #2021561] INFO -- : Memory usage: 2468880 | PID: 2021561
I, [2024-10-22T23:20:53.586179 #2021561] INFO -- : Instantiation Breakdown: Total: 583 | PageArticleLink: 543 | Collection: 8 | PageBlock: 6 | ArticleVersion: 6 | Article: 5 | EditorButton: 5 | User: 3 | Page: 3 | Work: 2 | Visit: 1 | Ahoy::Event: 1
I, [2024-10-22T23:20:53.586310 #2021561] INFO -- : Oink Log Entry Complete
From this, and looking at the page versions, we determined that the following:
The solution is to handle uncategorized links differently (don't delete!!) on the categorization cancellation.
According to CWRGM and WWP, we're deleting more than just uncategorized subjects. We're going to roll this out anyway, but there may continue to be problems we need to investigate.
CWRGM is one of our biggest users for subject linking. They recently reported:
and