AjaxMultiCommentary / ajmc-pipeline

Codebase for AjaxMultiCommentary
https://ajaxmulticommentary.github.io/ajmc-pipeline/
GNU Affero General Public License v3.0
4 stars 0 forks source link

word range overlap in canonical JSON #18

Open mromanello opened 7 months ago

mromanello commented 7 months ago

In the canonical JSON of Jebb (sophoclesplaysa05campgoog), the word ἔργον (word_range=32615) and the words ἐννέπειν δʼ: (word_range=32614, 32615) end up having overlapping word_ranges – which should never happen. This causes a problem with the display of corresponding glosses.

      {
        "word_range": [
          32615,
          32615
        ],
        "shifts": [
          0,
          0
        ],
        "transcript": "ἔργον,",
        "label": "word-anchor",
        "anchor_target": "{\"url\" : \"https://raw.githubusercontent.com/gregorycrane/Wolf1807/master/ajax-2019/ajax-lj.xml\", \"selector\" : \"tei-l@n=12[4]:tei-l@n=12[9]\", \"textAnchor\": \"ἔργον\"}"
      },
      {
        "word_range": [
          32614,
          32615
        ],
        "shifts": [
          0,
          0
        ],
        "transcript": "ἐννέπειν δʼ:",
        "label": "word-anchor",
        "anchor_target": "{\"url\" : \"https://raw.githubusercontent.com/gregorycrane/Wolf1807/master/ajax-2019/ajax-lj.xml\", \"selector\" : \"tei-l@n=12[17]:tei-l@n=12[28]\", \"textAnchor\": \"ἐννέπειν δʼ\"}"
      },

Printing out the word text for that range gives the following (wrong) output:

for buggy_lemma in buggy_lemmata:
    print([w.text for w in buggy_lemma.children.words])
['ov,']
['12', 'ov,']
pletcher commented 7 months ago

Hi Sven, no rush on this, but the ingestion process is finding 6 out of order lemmata. I've copied the log output below.

The second lemma object should always have a word range that's greater than the first, right? At least, that's the assumption that the ingestion process makes, so these could cause issues where too many words are pulled into a glossa.

[error] out of order lemmata: 
%{"anchor_target" => "{\"url\" : \"https://raw.githubusercontent.com/gregorycrane/Wolf1807/master/ajax-2019/ajax-lj.xml\", \"selector\" : \"tei-l@n=92[15]:tei-l@n=92[6]\", \"textAnchor\": \"παρέστης·\"}", "label" => "word-anchor", "shifts" => [0, 0], "transcript" => nil, "word_range" => [40344, 40350]}
%{"anchor_target" => "{\"url\" : \"https://raw.githubusercontent.com/gregorycrane/Wolf1807/master/ajax-2019/ajax-lj.xml\", \"selector\" : \"tei-l@n=93[0]:tei-l@n=93[5]\", \"textAnchor\": \"στέψω\"}", "label" => "word-anchor", "shifts" => [5, 0], "transcript" => "στέψω", "word_range" => [40346, 40346]}

[error] out of order lemmata: 
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [26253, 26253]}
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [26190, 26190]}

[error] out of order lemmata: 
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [26422, 26422]}
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [26363, 26363]}

[error] out of order lemmata: 
%{"anchor_target" => "{\"url\" : \"https://raw.githubusercontent.com/gregorycrane/Wolf1807/master/ajax-2019/ajax-lj.xml\", \"selector\" : \"tei-l@n=659[13]:tei-l@n=659[17]\", \"textAnchor\": \"ἔνθα\"}", "label" => "word-anchor", "shifts" => [0, 0], "transcript" => nil, "word_range" => [30065, 30065]}
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => "659", "word_range" => [30022, 30022]}

[error] out of order lemmata: 
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => "706", "word_range" => [31563, 31563]}
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [31458, 31458]}

[error] out of order lemmata:
 %{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [44158, 44158]}
%{"anchor_target" => nil, "label" => "scope-anchor", "shifts" => [0, -1], "transcript" => nil, "word_range" => [44142, 44142]}
pletcher commented 2 months ago

Just an extra note in case it's helpful: SchneidewinNauckRadermacher1913_0095's XMI seems to have swapped the columns in the commentary.