Search issues: making inline elements "invisible"

aso2101 commented 7 years ago

I think that when people search for terms they will want to ignore all of the text-critical markup. Thus for example K6 has the line ghara<unclear>sa</unclear> in div[@type='edition'], which is rendered as ghara(sa) by the ODD. But if one searches for gharasa, there are no results. In order to find the relevant passage one has to search for ghara*.

The following elements should be "ignored" (i.e., their text content should be combined with the immediately preceding or immediately succeeding text node, up to the space) for the sake of searching:

unclear: e.g., ghara<unclear>sa</unclear> should be indexed as gharasa (K6)
supplied: e.g., ga<supplied reason="lost">ha</supplied>patino should be indexed as gahapatino (Ku21)
add: e.g., <add place="below">gha</add>riniya should be indexed as ghariniya (no examples in my corpus yet)
del: e.g., <del>gha</del>ghariniya should be indexed as ghariniya (actually I think it is indexed this way anyway, so nothing needs to be done here).

These should also be combined, since these elements often occur together (e.g., <supplied reason="lost">bha</supplied><unclear>yata</unclear> should be indexed as bhayata).

If possible, the same kind of behavior should be applied to those elements even when they contain spaces (although this should happen much less frequently and I can't find any examples right now):

de<unclear>ya dha</unclear>ma should be indexed as deya and dhama

For the elements <choice> and <app>, which occur in the corpus relatively often, I am a bit more uncertain. I plan on moving 'inline' apparatus elements to an external apparatus for all of the inscriptions, so theoretically <app> should not match anything in div[@type='edition']. But I think that when one includes the apparatus in the search (I will post a separate issue for this) then all of the elements inside <app> should be considered potential hits (i.e., <lem>, <rdg>, and <note>), although the behaviour of the inline elements (<unclear> and <supplied>) should be the same as noted above.

For the element <gap>, I am not sure what to do. Right now, I think that <gap> just screws up any searches, in the sense that pu<gap/>ṇa will probably not match the terms pu, puṇa, puteṇa (which is probably what this stands for), etc. Would be be possible for <gap> elements to be treated as quasi-wildcards, so that a search term like "puteṇa" would match pu<gap/>ṇa?

Possibly @arlogriffiths will have something more to add.

wsalesky commented 7 years ago

Please test by deploying the new .xconf file (located in the data app)

arlogriffiths commented 7 years ago

If this is not already the case, presence of , and tags should also be ignored by search.

I hope I’m not forgetting anything else important.

Arlo

Le 15 août 2017 à 07:07, Andrew Ollett notifications@github.com<mailto:notifications@github.com> a écrit :

I think that when people search for terms they will want to ignore all of the text-critical markup. Thus for example K6 has the line gharasa in div[@type='edition'], which is rendered as ghara(sa) by the ODD. But if one searches for gharasa, there are no results. In order to find the relevant passage one has to search for ghara*.

The following elements should be "ignored" (i.e., their text content should be combined with the immediately preceding or immediately succeeding text node, up to the space) for the sake of searching:

unclear: e.g., gharasa should be indexed as gharasa (K6)
supplied: e.g., gahapatino should be indexed as gahapatino (Ku21)
add: e.g., ghariniya should be indexed as ghariniya (no examples in my corpus yet)
del: e.g., ~~gha~~ghariniya should be indexed as ghariniya (actually I think it is indexed this way anyway, so nothing needs to be done here).

These should also be combined, since these elements often occur together (e.g., bhayata should be indexed as bhayata).

If possible, the same kind of behavior should be applied to those elements even when they contain spaces (although this should happen much less frequently and I can't find any examples right now):

deya dhama should be indexed as deya and dhama

For the elements and , which occur in the corpus relatively often, I am a bit more uncertain. I plan on moving 'inline' apparatus elements to an external apparatus for all of the inscriptions, so theoretically should not match anything in div[@type='edition']. But I think that when one includes the apparatus in the search (I will post a separate issue for this) then all of the elements inside should be considered potential hits (i.e., , , and ), although the behaviour of the inline elements ( and ) should be the same as noted above.

For the element , I am not sure what to do. Right now, I think that just screws up any searches, in the sense that puṇa will probably not match the terms pu, puṇa, puteṇa (which is probably what this stands for), etc. Would be be possible for elements to be treated as quasi-wildcards, so that a search term like "puteṇa" would match puṇa?

Possibly @arlogriffithshttps://github.com/arlogriffiths will have something more to add.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/aso2101/satavahana-inscriptions/issues/60, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAzAE7-nsxL4EmiN_grCdgRLGrmdW2Jcks5sYSeBgaJpZM4O3MWJ.

aso2101 commented 7 years ago

add <space> and <milestone> (<lb> is already covered)

arlogriffiths commented 7 years ago

what about (used in encoding copper-plates)?

Le 19 août 2017 à 17:36, Andrew Ollett notifications@github.com<mailto:notifications@github.com> a écrit :

add and ( is already covered)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/aso2101/satavahana-inscriptions/issues/60#issuecomment-323530285, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAzAE83JItmh_n_pVEQsPxhNBauBheNmks5sZwDzgaJpZM4O3MWJ.

wsalesky commented 7 years ago

@arlogriffiths would a <pb> appear in the middle of a word?

aso2101 commented 7 years ago

@wsalesky yes in the situations arlo mentioned, <pb> can appear in the middle of a word.

wsalesky commented 7 years ago

@aso2101 Okay... adding it tonight.

wsalesky commented 7 years ago

Added to data repository (branch: https://github.com/aso2101/satavahana-inscriptions-data/tree/issue60). Redeploy to test.

aso2101 / satavahana-inscriptions

Search issues: making inline elements "invisible" #60