ProjectJaraid / jaraid_source

Master and authority files of Project Jarāʾid
Other
1 stars 1 forks source link

Authority file: correct entries (inlcuding Arabic) #86

Closed tillgrallert closed 3 years ago

tillgrallert commented 4 years ago

There are multiple issues with names in the authority file that will require fixing.

  1. The automatically generated translation of Latin into Arabic script needs to be corrected. Particularly for names that were not originally Arabic. Due to the mark-up in the original master file, these could not be filtered out beforehand. They can be found with some regex, I suppose, because the automatic translation often contains a mix of Arabic and Latin characters within a single word:
<persName xml:lang="ar-Latn-x-ijmes">Isaac Temām</persName>
<persName change="#d2e154" xml:lang="ar">اسc تeمام</persName>
  1. Some <persName> nodes contain multiple names separated by /. These need to be split into individual <persName> children of the containing <person> element.

  2. DONE Quite a few <persName>s do not carry an @xml:lang attribute and have, therefore, not been translated into Arabic script. @xml:lang needs to be added first and then we re-run the translation script.

tillgrallert commented 4 years ago

IMPORTANT NOTE: when cleaning the authority file and resolving duplicates, do not delete any IDs. Instead, unified entities should gather all IDs of the former duplicates. This is necessary because the master file links to these IDs and the links will be broken when IDs are deleted.

Mestyan commented 4 years ago

fantastic @tillgrallert I hope that you & family are safe in Beirut. I will turn to this at EST night. Question: if I clean authority file is the master automatically repopulated?

tillgrallert commented 4 years ago

Dear @Mestyan, yes, we are safe. Mainly because I travelled to Beirut alone ;-) Concerning your actual question, the master file has to be actively updated from the authority file.

Mestyan commented 4 years ago

Great! So I guess the best is to clean the authority file first?

On Nov 7, 2020, at 7:29 AM, Till Grallert notifications@github.com wrote:

Dear @Mestyan https://github.com/Mestyan, yes, we are safe. Mainly because I travelled to Beirut alone ;-) Concerning your actual question, the master file has to be actively updated from the authority file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ProjectJaraid/jaraid_source/issues/86#issuecomment-723440282, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFTHW4VO62K7RZJMPMFXCTDSOU4SRANCNFSM4TNRFAOA.

tillgrallert commented 4 years ago

The good thing is one can rerun the scripts at any one point. Therefore, there is no need to clean the authority file first. We can work incrementally instead.

tillgrallert commented 4 years ago

I fixed point 3 in the above issue description.

Mestyan commented 4 years ago

@tillgrallert I merged to master and I also merged my little additions of holdings from the last two days in gh-pages. There was a conflict but I checked and edited. Please fetch - just to be up-to-date! So I work solely on corrections in the master file (on gh-pages). I will also test the CETEIcean transformation and the JS needed to represent the new Arabic data. Very exciting!

Mestyan commented 3 years ago

Couple of questions:

  1. The authority file is not valid because there are two duplicate xml:id : "persName_525.d5e2753" "persName_528.d5e2768" what is the right method to correct this? can I delete them or overwrite them?
Mestyan commented 3 years ago
  1. is there a way I can identify the journal title in the authority file to which the name belongs in the authority file ?
Mestyan commented 3 years ago
  1. Can you try the Arabization of orgNames? Or, I can just start to add them manually but then, how should I generate xml:ids?
tillgrallert commented 3 years ago

Dear Adam,

  1. duplicate @xml:id can be manually changed and shouldn't break anything. They are automatically generated and apparently the code to do so isn't fool proof.
  2. I am afraid I don't understand the question. The set-up is thus that the authority file has no knowledge of any other files (master or not). It only provides (long) lists of people, organisations, and places which other files can link to through their respective IDs (as recorded in the <idno type="jaraid"> children).
  3. <orgName>s have already been Arabized. What still needs to be done is linking the master file to the authority file. So you can edit the <orgList> in the authority file like everything else. Please take not that quite a few <person> elements are indeed organisations and I have marked some of them with comments (<!-- ... -->). I will try and add the code to point from <orgName>s in the master file to the authority file over the coming days.
Mestyan commented 3 years ago
  1. excellent
  2. OK - for the cleaning purposes I always have to see the full entry in master to which the authority persName etc belongs. This is not a big deal just a tiny bit of changing windows and searching
  3. great, I will edit as well !
Mestyan commented 3 years ago

With the persName I follow our principle with the titles and if the original is an English/French/Italian etc in Latin characters then I just correct all of its versions into Latin characters, and correct the xml:lang attributes as well, OK? @tillgrallert

Mestyan commented 3 years ago

However I did one and it disappeared from master. Possibly I should not correct that xml:lang ? @tillgrallert

Mestyan commented 3 years ago

I stop working and rather wait your answer not to cause a huge problem again

tillgrallert commented 3 years ago

Dear Adam,

Don't worry. Everything works as intended. If there is no <persName xml:lang="ar"> for a <person> in the authority file, the XSLT cannot add it to the master file. Let's look at the example you provided:

<person xml:id="person_26.d5e406">
    <persName xml:id="persName_45.d5e408" xml:lang="ar-Latn-x-ijmes">Victor Barruland</persName>
    <persName change="#d2e154" xml:id="persName_6112.d5e30980" xml:lang="ar">v برلند</persName>
    <tei:persName change="#d5e194" corresp="#persName_6112.d5e30980" type="flattened" xml:lang="ar">vبرلند</tei:persName>
    <persName change="#d5e126" corresp="#persName_45.d5e408" type="flattened" xml:id="persName_1.d6097e1" xml:lang="en">VictorBarruland</persName>
    <idno type="jaraid">26</idno>
</person>
  1. You can completely ignore all <persName type="flattened"> nodes. These are computationally generated and their only purpose is to aid the computational look-up of names.
  2. Your change of @xml:lang from "ar-Latn-x-ijmes" to "fr" on the first <persName> is absolutely correct.
  3. What to do with Arabic versions of such names?
    1. We can just delete the automatically generated ones.
    2. You can come up with your own transliteration into Arabic and add an appropriate value ofr @xml:lang, such as "fr-Arab-AR" (meaning common rendering of French into Arabic).

The result would then look like (example)

<person xml:id="person_26.d5e406">
    <persName xml:id="persName_45.d5e408" xml:lang="ar-Latn-x-ijmes">Victor Barruland</persName>
    <persName change="#d2e154" xml:id="persName_6112.d5e30980" xml:lang="fr-Arab-AR">فكتور بريلان</persName>
    <!-- ... --> 
    <idno type="jaraid">26</idno>
</person>

In order to display some for non-Arabic names in the Arabic columns of the table, we could make the decision to use the original Latin-script name as a fallback option.

Mestyan commented 3 years ago

Hm, this is an interesting problem. In the case of titles in column 10 we decided to keep mixing languages because the idea is that the original product, ie. the journal itself, had various scripts in its header and we wanted to reproduce it. In this case, we do not really have an idea about whether the French/Italian/British/Syrian editors actually had their names in Latin and Arabic (and Hebrew) scripts, and if so, how they actually transcribed their names from one language to the other. Unless we re-check each and every case; and what we will find is that they did not confirm to the IJMES transliteration to Latin script or the proper transcription to Arabic script, of course. In these cases, we will have unconventional names. I would suggest that we use the original-script name in Latin. What do you think?

tillgrallert commented 3 years ago

I am fine with this fallback to the original Latin script name.

Mestyan commented 3 years ago

I looked into the XSLT which generates the action to generate a new master from the authority file to create the fallback option but it is over my knowledge. Can you do it please? So that column 11 would use the "en", "fr", "it" or whatever is in column 6 if there is no "ar" there? @tillgrallert ?

tillgrallert commented 3 years ago

This has been done