UniStuttgart-VISUS / damast

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World" (VolkswagenFoundation)
MIT License
10 stars 1 forks source link

Example when explaining regular expressions for "Place search" #67

Closed tutebatti closed 2 years ago

tutebatti commented 2 years ago

In the current example in the info text for the search of places, one reads:

The search field supports JavaScript-style regular expressions. For example, to search for locations with an Arabic definite article, the query \ba([tdrzsṣḍṭẓln]|[tds]h)- can be used.

If I understand correctly from the list of places, we do not use the DMG notation for Arabic articles (cf. https://de.wikipedia.org/wiki/DIN_31635). That example makes little sense, then. Any better suggestions. @rpbarczok, you probably no the data itself better than @mfranke93?

mfranke93 commented 2 years ago

I think at the time I wrote that example, at least some places did. I don't think any do anymore. The example might be too complex for casual users anyways, but I thought it was neat to demonstrate what it could be used for ;) feel free to do some simplification here. Maybe this doesn't have to be so detailed, and we can link here for power users that want to do more than normal text search.

mfranke93 commented 2 years ago

By the way: The code that sorts the place names alphabetically uses that exact RegEx to this day so that the Arabic definite articles don't affect the sorting, both for the visualization and the reports. This is also why this, and the initial apostrophe, are aligned differently in the location list.

tutebatti commented 2 years ago

I could think of something simple like

searching for Bagh?dad would find "Bagdad" as well as "Baghdad", because h followed by ? matches zero or exactly one h.

Linking to the external documentation is good, too.

Btw, why is searching Bagdad finding "Baghdad" already now? As far as I can see, the former is not listed under "alternative names".

mfranke93 commented 2 years ago

I could think of something simple like

searching for Bagh?dad would find "Bagdad" as well as "Baghdad", because h followed by ? matches zero or exactly one h.

I like it!

Btw, why is searching Bagdad finding "Baghdad" already now? As far as I can see, the former is not listed under "alternative names".

Because "Bagdad" appears in the simplified column of the alternative names (name_var) table. That is one of the places searched. Why the place search claims "external URI matches" is beyond me though. That is a bug (#68).

tutebatti commented 2 years ago

I like it!

:+1:

Because "Bagdad" appears in the simplified column of the alternative names (name_var) table.

But these simplified names are not displayed in the tooltip?

mfranke93 commented 2 years ago

No. The transcription is. See also:

tutebatti commented 2 years ago

I'm not sure if there's a misunderstanding, but I cannot see any section or something similar entitled transcription. grafik

mfranke93 commented 2 years ago

transcription is a column for alternative names. The primary name of a place is always transcribed already, but for alternative names, it could for example be in Arabic script, and then the transcription would provide a "European-readable" version of the name. If you look at the URI page for Baghdad, it is what is written in parentheses in the Arabic name variant (بغداد). This also appears in reports. There is no such section here because it is not an attribute of the place itself.

tutebatti commented 2 years ago

In other words, there is a match when searching because the term matches the simplified transcription of an alternative name?

At any rate, I will discuss this with @rpbarczok. I'm not sure how much of this behavior must be made transparent to the visitor who has no access to the db itself, but can only see the tooltip or the URI page which does not provide the simplified transcription either.

mfranke93 commented 2 years ago

In other words, there is a match when searching because the term matches the simplified transcription of an alternative name?

Yes. See https://github.tik.uni-stuttgart.de/frankemx/damast/issues/64

tutebatti commented 2 years ago

Ok. As @rpbarczok told me, simplified should be mostly consistent in that it represents (i.e., at least one of strings in simplified represents) a "normalized" form of transcription. It is sufficient to make that transparent to the user.

(It would be preferable, of course, if the transcription was automatically normalized according to given patterns and the results stored in a separate column. Apparently, this is not (easily) implementable. Entering simplified transcriptions manually is prone to errors.)

rpbarczok commented 2 years ago

I forgot to mention that we also add an english simplified transcription in the simplified table (e.g. gh, kh, j, sh etc.). Usually we use the simplified english transcription as the main name, but in the case that there is more than one Arabic variant. E.g. in the case of al-Ahsa. For the name variant هجر, we give the transcript Haǧar, and the simplified forms Hagar and Hajar.

tutebatti commented 2 years ago

but in the case that there is more than one Arabic variant

@rpbarczok, you mean "but only in the case that"...?

What is more, I'm not sure what to tell the user regarding what you stated.

mfranke93 commented 2 years ago

Just my 2 cents: We included this originally to make the search a bit more powerful and also forgiving. So, we wouldn't have to type names exactly (with the ǧ etc.), but could use a Latin g. Since this is quite hard to do only in software (there are a lot of letters with diacritics, hard not to miss some, ...) we decided it would be good to save the typical "latinified" names in the database. In my opinion, this is an implementation detail users do not need to know about at all. The only thing to communicate here would be that the search box is a bit more forgiving regarding exact spelling (or accepts variant spellings of places).

rpbarczok commented 2 years ago

I am sorry, the sentence was mutilated when editing it. What I mean is: For Arabic and other forms, we usually have one transcribed form in the transcription system of the DMG. Additional, we save the basic form of the letters in the simplified forms. We later decided also to include the simplified english trancription. So basically you can inform that the user usually should find a place also by entering the basic forms of the letters and by looking for a simplified English transcription, e.g. Hajar.

tutebatti commented 2 years ago

The only thing to communicate here would be that the search box is a bit more forgiving regarding exact spelling (or accepts variant spellings of places).

I might not be the average user in that case, but I would want to know how the search works exactly and how I can reproduce results. But I will certainly find an explanation (which you will correct, if necessary) for the current behavior. This is already pretty good:

So basically you can inform that the user usually should find a place also by entering the basic forms of the letters and by looking for a simplified English transcription, e.g. Hajar.