ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Support later version(s) of CBETA TEI XML #21

Closed ajenhl closed 8 years ago

ajenhl commented 9 years ago

TACL currently supports converting the old CBETA TEI P4 XML. It would be good to (instead or as well?) support their later offerings. Unfortunately, there appear to be a plethora of these:

At one point at least, some or all of these used different encoding schemes. Once it is clearer what most (potential) users of TACL are using, then a decision can be made as to which format(s) to support. And in the meantime, hopefully things will settle down to a single form of encoding!

ajenhl commented 9 years ago

Now that tacl strip only strips markup from one-file-per-text XML, the new tacl prepare (that joins multiple-files-per-text into one) can be expanded to handle the different formats of CBETA texts.

ajenhl commented 8 years ago

It appears that the repository at https://github.com/cbeta-git/xml-p5a/ is the better repository, based on freshness of changes; that should be the first (and perhaps only) new XML source that should be supported.

ajenhl commented 8 years ago

Added support for the repository at https://github.com/cbeta-git/xml-p5/ in 5a87eec19100ab9f240e7e2593bdba551ede350a.

While this is not the repository mentioned in the previous comment, it is the one recommended to me, so hopefully it will suffice!