There are many difficulties in commenting on this abstract:
The work which this abstract describes is unavailable, so it is hard to know how representative the
abstract examples are to the digitzation as a whole.
I have no experience in reviewing abstracts such as this, so have no preformed context for framing
comments.
With these caveats out of the way, I'll point out my general impression of the work, and several comments. Most of these remarks pertain to the part of the abstract discussing the Monier-Williams English-Sanskrit Dictionary, since this dictionary is available among the Cologne digitizations.
general impression
I view very favorably the idea of developing a digitization with markup of key features. I hope this markup of these two dictionaries (or at least of MWE) is carried to completion and made publicly available. My impression is that most of the interesting work done at Hyderabad is closely guarded, and I hope this work will prove an exception by being publicly available.
I would rate the abstract as average (3 out of 5) since there are many issues which I think need to be addressed for the work to fulfill its goal of providing the basis for 'an effective word-search tool.'
What is the input?
The abstract does not tell us what is the input. Of course it is the MWE (and Apte), but in what form?
Did the authors input all the 'm' words directly from the printed manuscript?
Or, did they add markup to the existing Cologne digitization of MWE?
The answer to this question is of critical importance in evaluating the 'project design and development aspects of the electron lexical resource' which the abstract aims to summarize.
If the authors input the data directly, at what point of the process was the markup added?
I think the ideal approach would be a two-step process:
develop a digitization with minimal markup; the markup at this point would only be that which
is needed to provide a digital imitation of the printed document. This is currently what is
provided by the Cologne digitization mwe.txt.
Add markup to the digitization. For experimental purposes, this could be done manually for
a few representative words. However, for the corpus as a whole (e.g., the sample of 'm' words
for MWE), this markup should be done by one or more computer programs.
There are several advantages to programmatically applied markup:
The end result (digitization plus markup) is reproducible. In the 'hard sciences' (physics, geology,
astronomy, etc.), there has been a recent emphasis on developing reproducible results.
I think computational linguistics would benefit by adopting this emphasis.
The end result is extensible. That is, once a program is developed which is deemed to
accurately apply markup to a sample (e.g., words beginning in 'm'), then the same program
applied to the entire corpus (the whole dictionary) would likely supply markup to the entire
dictionary. Indeed, one small but excellent idea of the authors is to restrict their initial phase of
work to the m's.
The end result is modifiable. There are innumerable ways to add markup to a text. DIctionaries
in particular are text documents with extremely complex structures. If the markup is done by
a program, then future researchers may have a leg up in providing an alternative to or extension
of the markup chosen by the authors.
Using a program to apply markup requires that the program be able to identify salient features
in the unmarked text. This discipline is superb in bringing to attention the inconsistencies in the
text.
A DTD for markup is needed
In the 'Tags used in e-lexicon of Monier Williams', the authors provide several examples of
the markup. In general, the examples provide implementation details relating to the five
'conventions' mentioned in the Preface of MWE. It is a strong point of this abstract that
the suggested markup relates to the dictionary author's stated intentions.
However, the nature of this relation becomes fuzzy in the abstract. As one instance, there
appears to be no markup corresponding the the 5th convention :Sanskrit meaning of an English synonym that has got a special meaning, ismentioned in single quotes.
This relation between markup and conventions, and also the overall scope (vocabulary) of the markup would be greatly improved by the presence of a DTD (or one of the other conventions in
XML for defining document structure.) Some advantages of having a DTD are:
validation that the end document conforms to the standard represented in the DTD. Otherwise,
it is harder for another user (such as one writing a search program for the marked-up
lexicon) to know how to process the document.
Exposure of the breaks with convention. Based upon work on the MW Sanskrit-English dictionary,
I have the expectation that the English-Sanskrit dictionary will have various features that do not
conform to the conventions mentioned in the Preface. The validation process will bring out such
instances.
Various small comments
&c. should be &c. to represent '&c.' in an xml document
In the example under 'Me', the original text (acc. c.) is lost.
<skt><case value="accusative"> मरक ; </case>
Side note: the example in pdf shows, correctly, as devanagari for slp1 mAM. However,
when I copy and paste into this note, the representation is (as you see) 'maraka'. Is the
Devanagari in the PDF encoded in something other than UTF-8?
It would be better to maintain the original text (outside of tags and attributes) , such as
In the example under 'To mean', there are several issues:
Here is the representation from the abstract (again, I have used slp1 since devanagari doesn't copy/paste properly from this pdf)
The scope of the <skt> tag includes non-sanskrit material, and also the textual 'c.2.' is
absent and its replace with &let;VergConjugation gaNa="2"> is outside the parenthesis.
The following is suggested as a better markup:
The 3rd-singular form of the verb (which is mentioned in the
4th convention) is unmarked; it would be more complete (and conformant to the conventions)
to mark this with something like
keep original headword form in Apte
The authors state
It was decided to change this pattern and give only प्रतिधदकमof a given word, so that suitable links
can be given from the dictionary to other dictionaries.
I think the original form should be maintained (again, this can be viewed as a corollary of the
principle that markup should be done so that the original text may be retrieved).
This could easily be accomplished by the following markup, which would retain the
useful property (facilitation of suitable links to other dictionaries):
<lexhead no="5">maRikaH<pratipadikam value="maRika" /> <cat>Noun s</cat>
(I assume the original headword was the 1s form maRikaH)
(Of course, both would (or could) be in Devanagari.)
Be sure Devanagari is in UTF-8 unicode form
This comment arises from the observed difficulty in cutting/pasting the devanagari from the PDF.
In the implementation of the marked up lexicon, the coding of Devanagari should be in the
most common form (so it may be recognized by other software). The Devanagari could be
coded in wx (which Hyderabad uses elsewhere). If left in Devanagari, the coding should be in
UTF-8 unicode form so that standard transcodings to other forms (e.g., wx, slp1) may be done by
existing transcoding software.
Final word
I hope this may prove useful to you, Dhaval. And I hope that the many criticisms are written so
that they sound constructive, which was my intent.
There are many difficulties in commenting on this abstract:
With these caveats out of the way, I'll point out my general impression of the work, and several comments. Most of these remarks pertain to the part of the abstract discussing the Monier-Williams English-Sanskrit Dictionary, since this dictionary is available among the Cologne digitizations.
general impression
I view very favorably the idea of developing a digitization with markup of key features. I hope this markup of these two dictionaries (or at least of MWE) is carried to completion and made publicly available. My impression is that most of the interesting work done at Hyderabad is closely guarded, and I hope this work will prove an exception by being publicly available.
I would rate the abstract as average (3 out of 5) since there are many issues which I think need to be addressed for the work to fulfill its goal of providing the basis for 'an effective word-search tool.'
What is the input?
The abstract does not tell us what is the input. Of course it is the MWE (and Apte), but in what form?
The answer to this question is of critical importance in evaluating the 'project design and development aspects of the electron lexical resource' which the abstract aims to summarize. If the authors input the data directly, at what point of the process was the markup added?
I think the ideal approach would be a two-step process:
There are several advantages to programmatically applied markup:
I think computational linguistics would benefit by adopting this emphasis.
A DTD for markup is needed
In the 'Tags used in e-lexicon of Monier Williams', the authors provide several examples of the markup. In general, the examples provide implementation details relating to the five 'conventions' mentioned in the Preface of MWE. It is a strong point of this abstract that the suggested markup relates to the dictionary author's stated intentions. However, the nature of this relation becomes fuzzy in the abstract. As one instance, there appears to be no markup corresponding the the 5th convention :
Sanskrit meaning of an English synonym that has got a special meaning, ismentioned in single quotes.
This relation between markup and conventions, and also the overall scope (vocabulary) of the markup would be greatly improved by the presence of a DTD (or one of the other conventions in XML for defining document structure.) Some advantages of having a DTD are:
Various small comments
&amp;c.
should be&c.
to represent '&c.' in an xml documentIn the example under 'Me', the original text (acc. c.) is lost.
It would be better to maintain the original text (outside of tags and attributes) , such as
In the example under 'To mean', there are several issues:
Here is the representation from the abstract (again, I have used slp1 since devanagari doesn't copy/paste properly from this pdf)
The scope of the
<skt>
tag includes non-sanskrit material, and also the textual 'c.2.' is absent and its replace with&let;VergConjugation gaNa="2">
is outside the parenthesis. The following is suggested as a better markup:The 3rd-singular form of the verb (which is mentioned in the 4th convention) is unmarked; it would be more complete (and conformant to the conventions) to mark this with something like
keep original headword form in Apte The authors state
I think the original form should be maintained (again, this can be viewed as a corollary of the principle that markup should be done so that the original text may be retrieved). This could easily be accomplished by the following markup, which would retain the useful property (facilitation of suitable links to other dictionaries):
Final word
I hope this may prove useful to you, Dhaval. And I hope that the many criticisms are written so that they sound constructive, which was my intent.