distantreading / WG1

Discussion documents and working papers from WG1
8 stars 9 forks source link

reprint count for medium and high #27

Closed CarolinOdebrecht closed 4 years ago

CarolinOdebrecht commented 5 years ago

We have different regulations on the reprint count:

The ODD says:

text has been republished very frequently since its original appearance
                                <valItem ident="medium">
                                    <desc>text has been republished occasionally since its
                                        original appearance</desc>
                                </valItem>
                                <valItem ident="low">
                                    <desc>text has not been reprinted since its original
                                        appearance</desc>
                                </valItem>

The sampling document says:

                <item>low: no reprints at all,</item>
                <item>medium: reprinted once,</item>
                <item>high: reprinted more than once,</item>
                <item>We will not include digitizations of texts in the reprint
                    count.</item>

This is crucial for "medium" and "high" The sampling document version is more restricted but clearer. It might be easier to just approximate reprint counts. We cannot assum that occasionally is the same for every language.

CarolinOdebrecht commented 5 years ago

This is linked with https://github.com/distantreading/WG1/issues/21

lb42 commented 5 years ago

Certainly the ODD and the documentation should be in step! We don't seem to have yet reached any clear consensus on what counts as "high" or "low", and the numbers are likely to be different in different contexts anyway. We do however agree that we need those two values at least.

I propose to

At present the schema also allows for "medium", Should we keep that or change it to "unmarked" if used?

CarolinOdebrecht commented 5 years ago

A binary decision is easier to handle. Introducing a category "unmarked" is also a good idea.

lb42 commented 5 years ago

So, at present, we allow high, low, unmarked, and unspecified. But if the value is unmarked it is ipso facto unspecified. And if after doing their best an encoder can only say something is unspecified, the effect for the user is just the same as if it was unmarked. The two are effectively synonymous. Since we use "unspecified" elsewhere, I propose to remove "unmarked" from the list of possible values and make the headChecker script convert any "medium" or "unmarked" values into "unspecified".

CarolinOdebrecht commented 5 years ago

Ok.

lb42 commented 4 years ago

Closing this, as we are in agreement!