funderburkjim / testing

For testing various features of github. Nothing important here.
0 stars 0 forks source link

comment on abstract #4

Open funderburkjim opened 9 years ago

funderburkjim commented 9 years ago

There are many difficulties in commenting on this abstract:

With these caveats out of the way, I'll point out my general impression of the work, and several comments. Most of these remarks pertain to the part of the abstract discussing the Monier-Williams English-Sanskrit Dictionary, since this dictionary is available among the Cologne digitizations.

general impression

I view very favorably the idea of developing a digitization with markup of key features. I hope this markup of these two dictionaries (or at least of MWE) is carried to completion and made publicly available. My impression is that most of the interesting work done at Hyderabad is closely guarded, and I hope this work will prove an exception by being publicly available.

I would rate the abstract as average (3 out of 5) since there are many issues which I think need to be addressed for the work to fulfill its goal of providing the basis for 'an effective word-search tool.'

What is the input?

The abstract does not tell us what is the input. Of course it is the MWE (and Apte), but in what form?

The answer to this question is of critical importance in evaluating the 'project design and development aspects of the electron lexical resource' which the abstract aims to summarize. If the authors input the data directly, at what point of the process was the markup added?
I think the ideal approach would be a two-step process:

  1. develop a digitization with minimal markup; the markup at this point would only be that which is needed to provide a digital imitation of the printed document. This is currently what is provided by the Cologne digitization mwe.txt.
  2. Add markup to the digitization. For experimental purposes, this could be done manually for a few representative words. However, for the corpus as a whole (e.g., the sample of 'm' words for MWE), this markup should be done by one or more computer programs.
    There are several advantages to programmatically applied markup:
    • The end result (digitization plus markup) is reproducible. In the 'hard sciences' (physics, geology, astronomy, etc.), there has been a recent emphasis on developing reproducible results.
      I think computational linguistics would benefit by adopting this emphasis.
    • The end result is extensible. That is, once a program is developed which is deemed to accurately apply markup to a sample (e.g., words beginning in 'm'), then the same program applied to the entire corpus (the whole dictionary) would likely supply markup to the entire dictionary. Indeed, one small but excellent idea of the authors is to restrict their initial phase of work to the m's.
    • The end result is modifiable. There are innumerable ways to add markup to a text. DIctionaries in particular are text documents with extremely complex structures. If the markup is done by a program, then future researchers may have a leg up in providing an alternative to or extension of the markup chosen by the authors.
    • Using a program to apply markup requires that the program be able to identify salient features in the unmarked text. This discipline is superb in bringing to attention the inconsistencies in the text.

      A DTD for markup is needed

In the 'Tags used in e-lexicon of Monier Williams', the authors provide several examples of the markup. In general, the examples provide implementation details relating to the five 'conventions' mentioned in the Preface of MWE. It is a strong point of this abstract that the suggested markup relates to the dictionary author's stated intentions. However, the nature of this relation becomes fuzzy in the abstract. As one instance, there appears to be no markup corresponding the the 5th convention :Sanskrit meaning of an English synonym that has got a special meaning, ismentioned in single quotes.

This relation between markup and conventions, and also the overall scope (vocabulary) of the markup would be greatly improved by the presence of a DTD (or one of the other conventions in XML for defining document structure.) Some advantages of having a DTD are:

I hope this may prove useful to you, Dhaval. And I hope that the many criticisms are written so that they sound constructive, which was my intent.