Open michaelnmmeyer opened 6 months ago
I believe the contents of those two repositories are all files encoded for previous projects and not (yet) migrated to DHARMA norms. If Arlo confirms that, I think it would be best to just ignore them for the time being, and perhaps create a new list of undocumented elements occurring outside those repos. Some comments:
<altIdentifier>
may have been used in files ingested from earlier projects; its use has never been formalised. See "Preserving the identifier of the text in the earlier project" in EGD Appendix G.<term>
is permitted in our editions, as per EGD §7.2.2. If The term elements have gloss siblings, then I think this should be fine. If you think it is called for, we might want to enforce @xml:lang
on <term>
; I believe there aren't many occurrences, so updating them manually should not be too much of a load.I don't know if any of the other elements are to be recognised as legitimate and in what contexts.
@danbalogh Thank you. My bad for <term>
, I missed it somehow.
Thanks.
I'd like to know where <term>
occurs. It may be a practice from EGC that has crept into a few inscription xml files.
To my mind we never really discussed how to deal with legacy files obtained by conversion to our model from the Campa and EIAD corpora, after Axelle had handled the import. We should probably have that discussion now.
Unless you feel it is a bad idea, I'd be happy to delete from the teiHeaders in our converted INSCIC files all such elements that are inherited from the ancestor files. I can do this manually or perhaps @michaelnmmeyer could automate the process. The percentage of files inherited from the earlier Campa corpus that will eventually be part of tfc-campa-epigraphy will be less than 25%, I think, and we don't need to be slavish about whatever best practices may be for reuse of xml data.
In fact I had had on my mind to discuss with @michaelnmmeyer the issue of EIAD files. These have been imported by Axelle at a fairly early stage of the project stage from my private iksvaku-inscriptions repo. Since it is the latter which has the source code for http://hisoma.huma-num.fr/exist/apps/EIAD/index2.html, and my collaborator Vincent Tournier requested some updates after Axelle imported the xml source files to erc-dharma, a small number of asynchronisms have arisen, with better data in iksvaku-inscriptions than what we have in tfb-eiad-epigraphy. I estimate it's a handful of cases, and they can probablky be tracked down easily via the record of commits on tiksvaku-inscriptions. Would @michaelnmmeyer accept to track down and make a list of the meaningful differences, if I gave him access to iksvaku-inscriptions, so we can next freeze that repo, implement the same changes on erc-charma, and only use the latter versions of the EIAD files henceforward?
@arlogriffiths
<term>
occurs exclusively in files from tfb-eiad-epigraphy (about 40 of them in total), e.g. DHARMA_INSEIAD00002.
If you think the extra data in CIC and EIAD files is unnecessary, I can delete them. For the EIAD files, I can produce a diff of the files Axelle processed and the latest revision.
Thanks @michaelnmmeyer. I will take a look at those cases of <term>
. I don't remember now why that element they would have been used there. I am now giving you access to iksvaku-inscriptions. Thanks for generating that diff!
About the removal of extra metadata from CIC and EIAD files, I'd like to have @danbalogh's advice. Can you look at a few files, Dan? In tfc-campa-epigraphy, converted files included DHARMA_INSCIC00001.xml, DHARMA_INSCIC00001.xml and DHARMA_INSCIC00064.xml. Thanks!
I would recommend against deleting any data that have already been encoded, unless we are very sure we don't need them. We still don't have a definitive setup for encoding roles and responsibilities in our DHARMA editions. We should perhaps try to sort that out in the EGD working group. At any rate, I think that until then the extra TEI header data in CIC and EIAD files should be either just ignored, or - if its presence bothers someone - commented out.
Thanks Dan. In the CIC files, there is stuff like <facsimile>
referring to specific image files that we are not using in DHARMA, so I have trouble imagining any future use.
@michaelnmmeyer : do you think commenting out by machine is an option? or would you only be able to automate the process in case we opt for deletion?
The main issue indeed touches on encoding roles and editorial responsibilities. I think indeed it is a high priority to bring that discussion to a conclusion and I'd be happy if you could take the lead. We have potentially thousand of files to be revised on this matter once the decisions have been taken, so we'd better get our act together.
While cleaning up our schema, I found a few elements that are not documented in the EGD but that occur in a significant number of inscriptions. This mainly concerns texts from tfb-eiad-epigraphy and from tfc-campa-epigraphy. Here is the list:
All these elements but
term
appear in theteiHeader
, and most of them are due to the addition of bibliographical data undersourceDesc
.I am not sure what to do with the data, but, in any case, I would prefer not to allow bibliographic entries to be encoded in TEI (with
biblFull
). Things would be simpler for me if we used<bibl><ptr ref="..."/></bibl>
with a Zotero entry everywhere.