erc-dharma / project-documentation

DHARMA Project Documentation
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Elements not documented in the EGD #299

Open michaelnmmeyer opened 2 months ago

michaelnmmeyer commented 2 months ago

While cleaning up our schema, I found a few elements that are not documented in the EGD but that occur in a significant number of inscriptions. This mainly concerns texts from tfb-eiad-epigraphy and from tfc-campa-epigraphy. Here is the list:

altIdentifier
biblFull
desc
editionStmt
editor
facsimile
graphic
institution
settlement
term

All these elements but term appear in the teiHeader, and most of them are due to the addition of bibliographical data under sourceDesc.

I am not sure what to do with the data, but, in any case, I would prefer not to allow bibliographic entries to be encoded in TEI (with biblFull). Things would be simpler for me if we used <bibl><ptr ref="..."/></bibl> with a Zotero entry everywhere.

danbalogh commented 2 months ago

I believe the contents of those two repositories are all files encoded for previous projects and not (yet) migrated to DHARMA norms. If Arlo confirms that, I think it would be best to just ignore them for the time being, and perhaps create a new list of undocumented elements occurring outside those repos. Some comments:

I don't know if any of the other elements are to be recognised as legitimate and in what contexts.

michaelnmmeyer commented 2 months ago

@danbalogh Thank you. My bad for <term>, I missed it somehow.

arlogriffiths commented 2 months ago

Thanks.

I'd like to know where <term> occurs. It may be a practice from EGC that has crept into a few inscription xml files.

To my mind we never really discussed how to deal with legacy files obtained by conversion to our model from the Campa and EIAD corpora, after Axelle had handled the import. We should probably have that discussion now.

Unless you feel it is a bad idea, I'd be happy to delete from the teiHeaders in our converted INSCIC files all such elements that are inherited from the ancestor files. I can do this manually or perhaps @michaelnmmeyer could automate the process. The percentage of files inherited from the earlier Campa corpus that will eventually be part of tfc-campa-epigraphy will be less than 25%, I think, and we don't need to be slavish about whatever best practices may be for reuse of xml data.

In fact I had had on my mind to discuss with @michaelnmmeyer the issue of EIAD files. These have been imported by Axelle at a fairly early stage of the project stage from my private iksvaku-inscriptions repo. Since it is the latter which has the source code for http://hisoma.huma-num.fr/exist/apps/EIAD/index2.html, and my collaborator Vincent Tournier requested some updates after Axelle imported the xml source files to erc-dharma, a small number of asynchronisms have arisen, with better data in iksvaku-inscriptions than what we have in tfb-eiad-epigraphy. I estimate it's a handful of cases, and they can probablky be tracked down easily via the record of commits on tiksvaku-inscriptions. Would @michaelnmmeyer accept to track down and make a list of the meaningful differences, if I gave him access to iksvaku-inscriptions, so we can next freeze that repo, implement the same changes on erc-charma, and only use the latter versions of the EIAD files henceforward?

michaelnmmeyer commented 2 months ago

@arlogriffiths

<term> occurs exclusively in files from tfb-eiad-epigraphy (about 40 of them in total), e.g. DHARMA_INSEIAD00002.

If you think the extra data in CIC and EIAD files is unnecessary, I can delete them. For the EIAD files, I can produce a diff of the files Axelle processed and the latest revision.

arlogriffiths commented 2 months ago

Thanks @michaelnmmeyer. I will take a look at those cases of <term>. I don't remember now why that element they would have been used there. I am now giving you access to iksvaku-inscriptions. Thanks for generating that diff!

About the removal of extra metadata from CIC and EIAD files, I'd like to have @danbalogh's advice. Can you look at a few files, Dan? In tfc-campa-epigraphy, converted files included DHARMA_INSCIC00001.xml, DHARMA_INSCIC00001.xml and DHARMA_INSCIC00064.xml. Thanks!

danbalogh commented 1 month ago

I would recommend against deleting any data that have already been encoded, unless we are very sure we don't need them. We still don't have a definitive setup for encoding roles and responsibilities in our DHARMA editions. We should perhaps try to sort that out in the EGD working group. At any rate, I think that until then the extra TEI header data in CIC and EIAD files should be either just ignored, or - if its presence bothers someone - commented out.

arlogriffiths commented 1 month ago

Thanks Dan. In the CIC files, there is stuff like <facsimile> referring to specific image files that we are not using in DHARMA, so I have trouble imagining any future use.

@michaelnmmeyer : do you think commenting out by machine is an option? or would you only be able to automate the process in case we opt for deletion?

The main issue indeed touches on encoding roles and editorial responsibilities. I think indeed it is a high priority to bring that discussion to a conclusion and I'd be happy if you could take the lead. We have potentially thousand of files to be revised on this matter once the decisions have been taken, so we'd better get our act together.

danbalogh commented 1 month ago

On roles and responsibilities, we need to sort out the missing details in my proposal in the Leftovers See also #242.