Open richardofsussex opened 2 years ago
I think this approach is very interesting. It would bring more structure to our Projects. Some of the ideas are already incorporated into FreeComETT and it would be worth mapping what's there already with Richard's outline above.
A properly abstract model of the transcription activity would help us produce a framework which is as generic as possible. I suggest that this model goes beyond the immediate online transcription task. This doesn't require us to implement the full model, but it may help us to integrate our transcription work into our systems in a more efficient manner. For example, starting in the middle with the page we want to transcribe, we can define a 'page' object. For a transcribable page, there will be one or more images of that page, available online. We can define the physical 'work' of which the page forms a part (in the case of a book format). This will have properties such as its pages (and/or just the number of pages), and the location and ownership of physical copies of the work. Each 'work' can be part of a larger 'resource', being a set of works with a common theme/purpose. These will have properties such as availability, licensing situation, etc. These higher-level objects are all things we have an active interest in, and which we try to record information about. We want to know what resources are available for transcription, now or in the future, and on what terms. For individual works, we want to manage the process of scanning, and know what pages have scans available. We need to allocate pages (or batches of pages - a new object type) to transcribers. This approach means that we create 'page' records which are placed within a context. Within each page, we need a generalized model which can cope with the variety of formats we expect to find. The commonest pattern is a table/grid, with headings across the top and one or more data rows, but other patterns occur too. For example, there may be a textual introduction, footnotes, or images. I would expect page 'templates' would be useful in situations where pages have a regular format. There needs to be a mechanism for 'marking' areas of the page image and associating them with a logical description: the most obvious example is to mark up an image, but it might be useful to mark a table of entries (or even individual entries) so that the relevant source image can easily be shown alongside the transcribed data. Logical data units are not always neatly contained within physical pages. There needs to be some way of [virtually] joining together the parts of a logical data unit which span one or more page boundaries. Once we get down to the data we are really interested in, there should be a close mapping to our data structures. This will allow us to provide online validation, suggestions, etc. Access to this support will be provided via the Rails app/model rules. Conversely, the flexible schema-less nature of MongoDB means that transcribers can be allowed to add in non-standard data freely; it will simply not be subject to the same controls and indexing as 'core' data.