ajenhl / tacl

Tool for performing basic text analysis on the CBETA corpus
GNU General Public License v3.0
30 stars 9 forks source link

Capturing variants #4

Closed fudaizhi closed 9 years ago

fudaizhi commented 11 years ago

Include in database (and tacl results) ngrams incorporating variant readings noted in the Taisho critical apparatus.

ajenhl commented 11 years ago

I don't know what an appropriate handling of these variant readings would be. There are two main issues:

  1. Whether and how to incorporate the variant reading into its context. Given "...abcABCabc...", with ABC having a variant DEF, should the n-grams cD, bcD, abcD, cDE, bcDE, abcDE, Ca, Cab, etc be added, up to the maximum size of n-gram generated? If these n-grams should overlap another variant....
  2. There is the potential for confusion, since nothing in the results will distinguish between an n-gram including (some of) a variant and one that the variant replaced. This might mask significant distinctions between texts, if one of those texts had a set of variants that brought it more into line (or out of line) with the other.

One possible approach is to generate an entirely new text or texts to capture the variants. This could be transparent to catalogue files (all extra texts would be automatically selected by the name of the base text), but the results could indicate that a particular n-gram came from a variant text.

fudaizhi commented 11 years ago

1a. Yes. N-grams containing variants should be listed, and counted, as separate entities. This means all real n-grams, i.e. all combinations of variants actually each attested (in their entirety) in some single witness for a given n-gram length (see next). 1b. Note that the Taisho only lists variants from a finite number of alternate witnesses, and the witness is always identified in the apparatus. It would be necessary to incorporate consideration of the source into the process of accounting for variants, so that variant readings from, say, the Ming version were only combined with other variants from the Ming version to be counted as alternate n-grams. We should not, by contrast, generate fictitious ngrams: e.g. aBcD, where B is a variant attested only in the Song and D a variant attested only in the Ming. This should reduce the number of actual permutations we need to count, where a stretch of text short enough to count as an n-gram contains more than one variant. 2a. There are genuine problems here, yes. The variant n-gram should not replace the n-gram representing the Taisho reading, but rather, should be listed along with it. But the most significant problem I see will be how to count variants, since abcd and Abcd might each be counted as a type of overlap/difference with some point of comparison, where they both represent "the same" stretch of text; but they do not have the same status, for purposes of comparison, as an 4-gram for which there are no variants (attested uniformly in all witnesses). 2b. The current method, which looks only at the Taisho readings, also runs the risk of masking significant information about relationships between texts - missing matches, where the match is only with the variant reading; missing differences, where the difference is only apparent in the variant reading. 3a. The idea of automatically generating (reconstituting) full variant texts (Song, Yuan, Ming etc.), as they are accounted for by the Taisho apparatus (which reportedly does contain errors, sometimes plentifully) is quite powerful. 3b. However, we would again confront a problem with how to count matches/differences. One possibility, which could lead to major distortions in our counts, would be that each variant version of a text (Song, Yuan, Ming etc.) would be counted separately, which would mean that our total counts for a given string - especially where there were NOT variants (and so all witnesses agreed, and multiplied one another's data) would be massively incorrect. 3c. Presuming that the majority of texts contain at least some variants in each witness, and this strategy would therefore require the generation of a separate text for each witness, we would also end up with an overall corpus multiplied by several times in size. This would presumably lead to a much larger overall database, with corresponding increases in run times, etc.

ajenhl commented 11 years ago

3b and 3c need not occur. I am envisaging the n-gram generation process creating the set of base text plus variants texts, creating n-grams for each of them, and then storing all of the n-grams for the base text and only those n-grams that are different for each of the variants. So if the base text had 23 instances of ABC, and the variant had 24, ABC would be stored with a count of 23 for the base text, and with a count of 1 for the variant text, while if they each had 12 instances of DEF, the base text would have a record of those 12 instances, while the variant would not list DEF at all. The counts would therefore not be distorted.

A query would specify, as usual, the source of the n-gram, but with an extra field stating the variant, if the result does not occur in the base text.

I'm not sure what the best way of handling an n-gram that occurs fewer times in a variant than in the base text; I guess a negative count is in keeping with the approach above for the opposite situation.

2a is, at least at this stage, a problem that tacl is not going to solve. Since tacl is dealing with n-grams and their counts, and has no information about specific instances of an n-gram, it cannot make it clear that when there are three matches on XYZ, one of them has a variant ABC. It would, with negative variant counts, at least be able to show that one of those instances is not present in a variant.

Does that seem a reasonable and correct approach?

fudaizhi commented 11 years ago

Yes. It would be great to have tacl do any of this, and this is a wishlist, so I have no expectation it will do it all. I would never have thought of some of these economical ways of handling parts of the problem.

ajenhl commented 11 years ago

Given the following XML:

<app n="0216003"><lem wit="【大】">若</lem><rdg resp="Taisho" wit="【宋】【元】【明】">苦</rdg></app>

we are currently taking 若 as the main/accepted reading. Presumably this will hold true when support for variants is added, leading to the following n-gram counts:

main: 若: x 宋: 若: x-1 元: 若: x-1 明: 若: x-1 main: 苦: y 宋: 苦: y+1 元: 苦: y+1 明: 苦: y+1

Are 宋, 元, and 明 suitable as identifiers for the different sources of various readings? Not that I can see any other option, given what the XML provides.

fudaizhi commented 11 years ago

Yes, it is correct to take 若, i.e. the reading marked 大, as the main reading.

Yes, 宋, 元, and 明 are suitable identifiers for the alternate sources. There are also others, the main one being 宮, but less frequently, there are several more.

ajenhl commented 9 years ago

I don't think my scheme above is suitable as it stands. Specifically, I think that given the reduce and extend functionality of tacl report, the output of a query must include the full data for each variant. Now, this could be generated automatically as part of a query from the '+1, -2, no difference' data that I proposed, but that would constitute a whole stack of additional queries and calculation.

On the other hand, as you point out, multiplying the size of the database several-fold is not ideal either. This is, however, the simplest and most elegant approach, leaving summarising variant differences to a report function. I'm hopeful that the increase in database size won't cause queries to be much slower; if it does, then I may need to look into using a different database system. My main concern is just the space requirements in terms of users having a free few hundred gig to fit all this!

ajenhl commented 9 years ago

All of the changes in the variants branch have been merged into master, supporting multiple witnesses per text.