Closed blms closed 9 months ago
are the non-breaking characters important? wondering if we can clean up in the db (one-time cleanup and ongoing cleanup)
alternately can add a rule in solr to convert non-breaking space to space for search purposes
@rlskoeser I would guess they are not important, since they seem to be applied rarely and inconsistently.
Here's a small sample of some descriptions where they appear:
Letter fragment in Judaeo-Arabic. Dating: probably 13th century. Mentions many greetings. Names include: Mūsā; Yehuda; Sulaymān al-Kohen al-Ṣayrafī; Abū
\xa0
Naṣr b. Sālim; Hārūn; Abū l-ʿAlāʾ\xa0
b. Nuʿmān; Namir; Abū l-ʿIzz al-Murahhiṭ (composer of liturgical poetry; he appears also in T-S K15.43 (PGPID 4619)); and Ibn Abū l-Ghayth. Also mentions commodities such as amomum (qāqulla). (Information in part from CUDL.) There are probably several joins waiting to be found.
Letter in the hand of Yefet b. Menashshe. In Judaeo-Arabic. Two fragments comprising the bottom part of the letter. Refers to the qāḍī Amīn al-Dawla Abū
\xa0
ʿAlī; a request that the addressee put in a good word for the bearer of this letter, Abū\xa0
ʿAlī Ibn Qaṭāʾīf, who has never purchased goods from government bureaus (dīwān) or public auctions (ḥalqa) (והו מא ישתרי שי מן אלדיואן ולא מן חלקה ולא לה אסם בשרא חואיג מן דיואן) but who is being persecuted by the police (al-rajjāla) on account of his unemployment. There is a version of a raʾy clause toward the end. (Information in part from CUDL.) Join: Alan Elbaum.
Letter
\xa0
addressed to Shelomo Ḥalafta Yerushalmi and Meir Ashkenazi, concerning business, the commodity saffron (זעפראן)" is mentioned.\xa0
R. Yiṣḥaq Luria\xa0
(ha-Ari)\xa0
was a part of this business as well.\xa0
(Information from David Avraham, Alei Sefer 14, 1987, p. 135, and David Avraham, “The Role of Egyptian Jews in Sixteenth-century International Trade with Europe,” in From a Sacred Source: Genizah Studies in Honour of Professor Stefan C. Reif,” 106 ). VMR and EMS Verso: Jottings of names and accounts, mentioning Rašīd and Nissim Sason. C. 15th-17th century. (Information from CUDL)
Recto: fragment of
\xa0
a\xa0
recommendation letter\xa0
from Daniel b. Azarya\xa0
to Eli ha-Ḥaver b. Amram, Fustat. The name of the person who is recommended is unknown but it seems he belonged to one of the Israeli Gaon families. (Gil, Palestine, vol. 2, 688-689, Doc. #372)\xa0
. VMR
PGPID 19595 (this was the only one I found where it appears inside a snippet of JA, or Arabic or Hebrew):
Legal document in Judaeo-Arabic. Fragmentary, so difficult to figure out the details. Sets out provisions for the care of a minor boy until he comes of age. Mentions a female slave (al-jāriya al-kabīra) twice, once in the context of someone being granted ownership (תצרפת
\xa0
פי גאריתהא\xa0
תצרף אלמלאך).
Do you think it's worth pinging the researchers here, or do you agree it seems inconsequential? I would guess it's from copy pasting, maybe from Word.
I like your idea of the one-time migration and applying it as a cleanup to every new description added.
The ones between parts of names seem potentially intentional, but the rest do not! Maybe a quick check with the research team if they ever intentionally enter non-breaking spaces that need to be preserved. I wonder if some of these are holdovers from some other system - can you easily tell from log entries if there is something common about where they were imported from, when they were created?
You should probably check if these are present in other fields - transcriptions seem most likely to have messy unicode characters, but could imagine them showing up elsewhere, in names maybe.
@rlskoeser Will do!
Good catch about checking the log entries. Dates are all over the place for when they're created (1986-2021), but most were ingested by spreadsheet import. However, I'm not sure if that's just proportional to the data overall, as around 12% of them were directly created in the admin.
None of this character in any other Document field or any Annotation fields, and none in any Names (yet! but we don't have many of those).
From Slack:
Alan Elbaum: Many of the examples were written by me, and they were not intentional or meaningful! Could be something to do with copying/pasting, or maybe something to do with the diacritics?
Marina Rustow: I am reasonably certain that none of those is intentional. Can you throw the whole list at me so I can double check? And then let’s banish them. How do you even produce that on a keyboard? If it’s something like option-space, then it might be an artifact of a heavy foot on the clutch as it were — pressing option too early or for too long.
Ben Silverman: I think you’re exactly right about how they were produced, as it is indeed option-space on Mac.
Decision from team is to create a migration for existing records and apply cleanup to new entries in the future.
@blms works as it should (only one record pops up at the above search). Closing, thank you!
testing notes (QA)
on the QA public site:
"Abū Naṣr b. Sālim"
and confirm at least one record (PGPID 17583) shows updev notes
This causes a problem in search indexing. For example,
"Abū Naṣr b. Sālim"
will not return any results, because the actual description as indexed reads"Abū\xa0Naṣr b. Sālim"
(pgpid 17583).Quick queries show about 277 records with this character in the description.