distantreading / WG1

Discussion documents and working papers from WG1
8 stars 9 forks source link

canonical criterion #6

Closed lb42 closed 6 years ago

lb42 commented 6 years ago

"at least 30% are highly canonized novels, at least 30% should be non-canonized novels, based on reprints two groups: 1 Group: canon. reprinted more than once 2. Group: non-canon. not or once reprinted, period: 1980-2000"

I think we could make this a bit more objective and also improve its granularity:

canon0 - not reprinted during the author's lifetime canon1 - reprinted up to three times after the author's death canon2 - reprinted more than three times after the author's death

where "reprinted" means "published as a new edition", e.g. by a different publisher/in a different form etc. I feel that we shouldn't include as a reprint the digitization of a text in eg. Googlebooks, but maybe that is mistaken. The texts that appear in e.g. Gutenberg or other digital collections are in some sense chosen for their canonicity or acquire it by virtue of appearing there.

lb42 commented 6 years ago

The sampling document also talks about "reprint count"

Reprint count We propose to use the number of times a work is reprinted as an objective measure of its reception, using categories like the following: low: reprinted less than 10 times medium: reprinted 10 to 100 times high: reprinted more than 100 times

is this the same or different? which one do we want to use?

CarolinOdebrecht commented 6 years ago

Thank you very much! We adjusted this criterion during the WG meeting in Prague (similar to the length criterion). We decided to use following: at least 30% are highly canonized novels, at least 30% should be non-canonized novels, based on reprints two groups: 1 Group: canon. reprinted more than once 2. Group: non-canon. not or once reprinted, period: 1980-2000 I updated the sampling document accordingly.

As for the granularity of this criterion: This might make things much more complex regarding the metadata (are we able to find out?) and the selection of texts. What if a text is not reprinted during the author's lifetime but reprinted up to three times after the author's death. This would mean that the text is at canon0 and canon1. As for not including text digitization into the reprint count: I think this is uncontroversial. I added a note in the sampling document to be more explicit.

lb42 commented 6 years ago

Sorry, but this is still not clear to me. If 30% are "highly canonized" and 30% are non canonized, what are the remaining 40%?

christofs commented 6 years ago

Each of the two groups is at least 30% large, but either one could be larger. So 40/60 is in keeping with this criterion, as is 30/70 or 50/50. We just don't want to have less than 30% of either type.

lb42 commented 6 years ago

Thanks for the clarification. So the percentages are only intended as minima: e.g. a balance of 30% non canonical, 70% canonical would be OK, as would the reverse. Sorry to be dim.

Presumably the same principle should apply to the length criterion, i.e. there should be a minimum of 20% for each of the three length categories identified (short, medium, long) though the document doesn't say this.

christofs commented 6 years ago

Right! I also agree that we should clarify how this principle of "at least x %" applies to the length criterion. I like your suggestion of "at least 20% from each of the three length categories".

CarolinOdebrecht commented 6 years ago

The length criterion is described by "at least 20% are short novels (10-50k word tokens), at least 20% are long novels (>200k word tokens)"