delph-in / erg

English Resource Grammar
MIT License
17 stars 3 forks source link

Release notes for stable version "ERG 2023"

Highlights: Improved overall syntactic coverage on Redwoods profiles to 93.77% on 100K items Improved parse selection by about 1% using new redwoods.mem model. Improved overall parsing efficiency by about 20%.

2021-12-14 - Added files for Singlish dialect, authored by Siew Yeng Chow based on her Master's thesis at NTU.

2022-07 - Incorporated changes to enable chart-mapping in LKB-FOS, thanks to John Carroll.

2022-10 - Adopted Emerson-Turing construction types for appending SLASH, with thanks to Guy Emerson and John Carroll.

2022-11 - Improved Version.lsp, METADATA, and grammar-loading files for better interface with LTDB, thanks to Francis Bond.

Because we now generate erg.hds file each time the grammar is loaded into LKB, discarded erg/etc/rules.hds.

Release notes for stable version "ERG 2020"

Punctuation marks now separate tokens

Full Redwoods treebank update

Documentation strings throughout the grammar

In trunk, as an interim update,

Release notes for stable version "ERG 2018"


Token mapping



Platforms and applications

Release notes for trunk version 2016-09-27

Now underway in full-forest treebanking of Redwoods profiles and eventually WSJ as well, and making minor grammar corrections along the way.

Release notes for trunk version 2015-06-19

[After a long hiatus, returning to commenting on trunk version changes.]

Tuned paraphrase rules both for educ and for openproof. The educ set are mostly for generating variant correct answers for the new Reading composition exercises in the Redbird Language Arts course. The openproof modifications are aimed at reducing the remaining ambiguity in the generated English outputs.

Release notes for trunk version 2013-03-19

Added two constructions motivated by Sherlock Holmes corpus: (1) adverbial clauses with gaps and verbs of saying, as in |You have, I presume, considered this.|. (2) adverbial indefinite NPs as VP modifiers, as in |He arrived a hero and departed a villain| Also improved treatment of present participles as adjectives, employing verb predications for semantics.

Inflectional rules: instances made one-to-one with types

Release notes for version "ERG (1212)"

Stable tagged release, including updates of all tsdb/gold profiles. This release is also used for the treebanked profiles of DeepBank 1.0, the Wall Street Journal corpus included in the Penn Treebank.

Details and an online demo can be found at

Release notes for version "ERG (1111)"

Stable tagged release, including updates of all tsdb/gold profiles, plus the addition of two new profiles from the Tanaka corpus: rtc000 and rtc001. Details on ERG coverage of all gold profiles can be found on the Redwoods web page:

Update of `trunk' version as of August 2011:

Added coverage for the following phenomena:

Also made minor improvements for generation, including corrected trigger rules.

Gold profile updates are included only for csli, mrs, hike, cb, and jh1

Release notes for version "ERG (1010)"

Stable tagged release with full (manual) updates of all gold profiles including LOGON, WeScience, and (after a long hiatus) the Verbmobil and ecommerce treebanks, along with the newly added SemCor (semantically tagged portion of the Brown corpus - the first 3100 items so far). Details on current ERG coverage of these profiles can be found on the Redwoods web page:

Release notes for version "ERG (1007)"

Minor improvements for better coverage of WSJ corpus and of the education and speech application corpora.

Release notes for version "ERG (1004)"

This is intended as a stable' release, accompanied by a full manual update of thegold' treebanked profiles, and parse-ranking models trained on them.

Release notes for version "ERG (1003)"

Release notes for version "ERG (1002)"

Release notes for version "ERG (0909)"

Release notes for version "ERG (0907)" (the Barcelona release)

Release notes for version "ERG (0902)"

Release notes for version "LinGO (July-08)"

Release notes for version "LinGO (Apr-08)"

Release notes for version "LinGO (26-Jan-08)"

Final tuning for SciBorg's first treebank of six abstracts Final tuning for LOGON/HandOn treebank update

Release notes for version "LinGO (24-Jan-08)"

A few corrections to lexical entries based on most recent HandOn fan-outs

Release notes for version "LinGO (23-Jan-08)"

Added a few missing lexical entries for degree specifiers

Release notes for version "LinGO (21-Jan-08)"

And still more tuning - maybe the final round - for HandOn

Release notes for version "LinGO (20-Jan-08)"

  1. More tuning for HandOn driven by 'sti' and 'vei' fan-out logs

Release notes for version "LinGO (17-Jan-08)"

  1. Minor adjustments to lexicon, grammar, and trigger rules for fine-tuning of HandOn system.

Release notes for version "LinGO (15-Jan-08)"

  1. Added vocabulary for HandOn based on missing predicates from NoEn
  2. Completed tuning of lexicon and preprocessing for HandOn English data
  3. One recent change that may affect transfer: Decomposition of N-V compounds like "snow-covered" and "T-marked"
    • used to be multi-words with single predicate, but are now constructed via compound rule, with the two component EPs and an additional linking EP with PRED |argument_rel| similar to |compound_rel|

Release notes for version "LinGO (Nov-07)"

  1. Treebanks
    • Updated all treebanks in erg/gold, but have not yet rebuilt jhpstg.mem file
  2. MRS quality improvements / harmonization
    • Added type constraints on ARG1s for several classes of modifiers
    • Corrected missing semantic link in P-PP construction "from behind the hill"
    • Removed spurious pron_rel from infinitival subordinate constructions like "Kim sang to impress Sandy."
    • Made minor changes to title construction:
    • changed pred name for post-head titles to be consistent with pre-head one
    • corrected rule for number-headed phrases like "page 3"

Release notes for version "LinGO (Oct-07)"

Added lexical coverage for vocabulary in the English data for the HandOn project, in this case keeping the large number of domain-specific proper names in a separate file 'handon-propers.tdl'. Also made some repairs to remaining inconsistencies in MRSs in the message-free universe.

In addition, did several bits of minor tuning of syntactic constructions in support of the DFKI Checkpoint project, and added first version of the token-mapping rules for PET's emerging support for this functionality. This release also includes an additional settings file for PET, 'mrs.set', to support development of generation capability for PET.

Note that only three of the 'gold' profiles (csli, hike, and mrs) have been updated in this release; the rest will follow shortly.

Release notes for version "LinGO (Jul-07)"

Added lexical coverage for several additional treebanked data sets, including Senseval 2-4, FraCaS, SciBorg, and Acrolinx (though the latter two data sets are not distributable). Also updated the full set of 'gold' profiles for the existing data sets.

PLEASE NOTE that this version requires an up-to-date version of the LKB to get correct behavior with the treebanked data in 'gold', since the derivation trees are now augmented with a specification of which root constraint was used to admit each tree.

Release notes for version "LinGO (21-Mar-07)"

The most significant change in this version of the ERG is the complete removal of messages, as announced at the Fefor DELPH-IN meeting to follow the completion of the LOGON demonstrator. This version is a nearly exact non-msg equivalent of the final LOGON version "LinGO (17-Mar-07)", so it should be straightforward to compare and contrast the two variants. In brief, the distinction among propositions, questions, and commands is now made via the value of the attribute SF ('sentence force' i.e., illocutionary force), a property of events. This attribute and its values are also used in the most recent release of the Grammar Matrix.

In addition, this release contains the following modifications/improvements, the first of which is also included in the final LOGON version:

Release notes for version "LinGO (17-Mar-07)" (Final version with messages)

Added missing lexical entries for the known-vocabulary held-out portion of the LOGON corpus (43 proper names and 5 common nouns)

Release notes for version "LinGO (20-Dec-06)" (Final LOGON version)

Release notes for version "LinGO (19-Dec-06)"

Release notes for version "LinGO (15-Dec-06)"

Release notes for version "LinGO (14-Dec-06)"

Release notes for version "LinGO (13-Dec-06)"

More adjustments for final LOGON integration:

Release notes for version "LinGO (01-Dec-06)"

Minor additions for final LOGON integration:

Release notes for version "LinGO (Nov-06)"

NOTE: Users of this version of the ERG are strongly encouraged to also obtain a current version of the LKB and [incr tsdb()], in order to benefit fully from recent enhancements.

Release notes for version "LinGO (13-Oct-06)"

Added entries for digit-orthography cardinal adjectives to help generator.

Release notes for version "LinGO (12-Oct-06)"

Maybe the final round of tuning for this integration:

  1. Merged falsely ambiguous lexical predicates:

    NEW OLD "_fine_a_for_rel" "_fine_a_1_rel" "_good_a_at-for_rel" "_good_a_for_rel" "_good_a_at-for_rel" "_good_a_at_rel" "_good_a_at-for_rel" "_good_a_1_rel",
    "_understand_v_by_rel" "_understand_v_1_rel"

  2. Added missing trigger rules for it-cleft construction.

  3. Corrected a few minor errors in grammar rules.

Release notes for version "LinGO (10-Oct-06)"

Still more minor tuning

  1. Corrected entry for "guess" unknown-noun lex entry to work in compounds
  2. Corrected NP fragment rules to allow fragments that are conjoined NPs
  3. Enabled entry for prep "to" to also modify proper names.

Release notes for version "LinGO (09-Oct-06)"

More minor tuning for impending LOGON release:

  1. Fixed spelling of 'considerred' for dative passive form
  2. Enabled generation of implicit NP coordination
  3. Corrected lexical entry's PRED name for 'edge'
  4. Corrected modification of imperatives
  5. Added missing topmost message for sentence-initial conjunction
  6. Allowed Adj-N as title
  7. Added missing entries for 'follow' and 'transport': NP+PP-dir
  8. Corrected multiple SEM-I entries for 'choose' verb
  9. Added missing analysis of NP's + PP construction
  10. Added lexical entry for 'mountain pasture' as title
  11. Renamed inconsistent degree adverb preds: "_a+little_x_deg_rel" "_steeply_x_deg_rel" "_directly_x_deg_rel" "_shortly_x_deg_rel"
  12. Added entry for adj 'so' ('true') with expl-it subj: "It is so that ..."

Note that only the gold profiles for 'csli', 'mrs', and 'hike' have been updated for this release.

Release notes for version "LinGO (05-Oct-06)"

Minor tuning for upcoming LOGON release:

  1. Corrected PRED name for "downstairs" Old SEM-I entry: "_downstairs_a_1_rel" : ARG0 e, ARG1 u. New: _downstairs_p_rel : ARG0 e, ARG1 u.
  2. Corrected nbar-fragment rule to also analyze measure-nouns like "centimeter"

Release notes for version "LinGO (27-Sept-06)"

Release notes for version "LinGO (11-Sept-06)"

Release notes for version "LinGO (18-Jul-06)"

Release notes for version "LinGO (Jul-06)"

Internal release notes for version "LinGO (08-Jun-06)"

Small corrections to semantics of title nouns, both alone and in compounds. Note that only the 'gold' profiles for csli, mrs, and hike have been updated.

Internal release notes for version "LinGO (24-May-06)"

Added lexical entries needed for remaining LOGON development corpus (Turglede and Preikestolen texts). Made semantics for comparatives, superlatives, and much/many more consistent. Reduced generation output of variants with commas for modification & coord. NP-coord - Corrected semantics, adding qeq (more consistent, and more scopes) Free rels - Made embedded message be prpstn_m_rel, not underspecified. Corrected semantics errors throughout, using Utool Added treebank profiles for ps (Preikestolen) and tg (Turglede) data.

Internal release notes for version "LinGO (13-Feb-06)"

Corrected semantics for quantifiers 'most' and 'the most', dropping the predicate 'most_q_rel' in favor of decomposed semantics using the usual "many-much_a_rel".

Internal release notes for version "LinGO (09-Feb-06)"

Minor improvements in SEM-I content, and correction of an item in gold MRS.

Internal release notes for version "LinGO (06-Feb-06)"

More harmony for depictives, now with same semantics as other subordinate clauses. Also corrections to SEM-I for directional PP verbs.

Internal release notes for version "LinGO (03-Feb-06)"

Improved harmony:


Release notes for version "LinGO (Jan-06)"

PLEASE NOTE: This version of the ERG requires up-to-date versions of both the LKB and PET, since it takes advantage of improvements in the treatment of morphology in the LKB, and also depends on a consistent treatment of special characters like \?, (, and \".

This version includes minor tuning adjustments to the lexicon and grammar, to improve overall precision and coverage on the data sets included in the Redwoods 6 (Norwegian Growth) treebank, which has been expanded to include about 5000 items from the LOGON development corpus on Norwegian back-country tourism. The single-best-parse profiles for this additional data appear as usual in the subdirectory 'gold', in the six directories jh0 - jh5.

In addition, the grammar now includes a semantic interface file 'erg.smi' which currently specifies the minimal properties of each lexical predicate, including its name and its arguments, their types, and their optionality. This file should soon also include the grammar predicates (those introduced by rules rather than by lexical entries), as well as the set of abstract predicates which are intended as part of the external interface to the grammar.

Release notes for version "LinGO (05-Dec-05)"

  1. Punctuation - Eliminated the duplication in files that was formerly needed for minor differences between the LKB and PET, now resolved.
  2. Lexicon - Added vocabulary needed for the LOGON development corpus on tourism in the Norwegian mountains.
  3. Generation - Tuned the trigger rules for introducing semantically empty lexical entries, for improved efficiency.
  4. Treebanks - There are now additional profiles jh* in the directory gold, for several segment of the LOGON development corpus for the Jotenheimen region. In this release, only jh1 is updated; the other five sections will follow soon. The other (non-LOGON) profiles are all up to date.

Release notes for version "LinGO (23-Nov-05)"

  1. Corrected lexical entries for "write" and "unevaluated", as well as the preprocessor-related "twodigitdomersatz". Also added entry for "untrafficked".
  2. Repaired error in comma punctuation which was causing overgeneration.
  3. Corrected error in lexical types for day-of-month entries which was producing ill-formed MRSs.

Release notes for version "LinGO (15-Nov-05)"

  1. Added and corrected lexical entries and SEM-I

    • Most interestingly, added some entries for 'kind' readings, as for the noun "bear" in "they hunted bear." The predicate names are distinct, since presumably these would be derived from some lexical rule producing a distinct sense, and take the form "__n_kind_rel"
    • Changed the single entry for the adjective "born" so it is treated semantically more like the passive participle it once was, and now introduces the predicate "_bear_v_2_rel" with a distinct sense of the verb "bear" from that in "Kim can't bear to lose"
    • Made changes in response to requests from JTL for transfer.
  2. Tuned grammar in minor respects to improve consistency in treebanking the JH corpus.

Release notes for version "LinGO (10-Nov-05)"

  1. Corrected SEM-I and lexicon errors noted by JTL, and improved constraints on lexical types with handle arguments so the SEM-I reflects these (introducing e.g. [ ARG3 h ] instead of formerly [ ARG3 u ]).
  2. Added a few more lexical entries needed for JH, and some minor syntactic additions for constructions like "Try it yourself" and "Kvame became sole owner".

Release notes for version "LinGO (05-Nov-05)"

Quick additional release to make improvements for treebanking Jotenheim

  1. Punctuation - Cleaned out a few more temporary patches in preprocessor and lexicon, especially for |"|, |(|, |)| which had had substitutions.
  2. Preprocessing - Added a few more cases revealed by Jotenheim data.
  3. Lexicon - Added a few missing multi-words that emerged from initial treebanking, and changed a few more formerly relational nouns to just ordinary nouns, to avoid spurious ambiguity 'top, bottom, side, front, back' Also (finally) corrected the pred names for "anybody", "someone", etc. to now use _any_q_rel rather than any_q_rel, and same for _some_q_rel.
  4. Fixed TPC assignments in relative clauses and for 'wonder'.
  5. Corrected nominalization, which became too constrained in an attempt to avoid spurious ambiguity.

Release notes for version "LinGO (01-Nov-05)"

  1. Tuned generation trigger rules to reduce overgeneration, improve efficiency Also attempted to make more consistent use of TPC, PSV, allowing underspec.
  2. Revised morphology to benefit from improvements in LKB and later in PET, now that irregularly inflected words can co-exist with punctuation suffixes (so eliminated files inflr-pet.tdl, inflr-pnct-pet.tdl, robust.tdl, and robust-pnct.tdl).
  3. Reduced inventory of scopal adverbs, and improved consistency for adverbs. Note in particular that most so-called discourse adverbs have been converted to scopal adverbs, and the conjunctions 'and, or, but' are now treated as such even when they are sentence-initial.
  4. Corrected some errors in lexical types and in syntactic rules; in particular fixed type for mass_ppcomp, which was broken, and improved nbar-coordination whose semantics was not ideal.
  5. Some other lexical changes:
    • 'both' determiner is now logically equivalent to "the two".
    • 'respect (for)' wasn't entered as a mass noun, now is.
    • 'cross_over_v1, _v2' removed from lexicon (now done compositionally)
    • various entries for cardinal "one" had CARG "01", now just CARG "1".

Release notes for version "LinGO (09-Sep-05)"

  1. Repaired punctuation overgeneration for non-WH topicalization, by removing a licensing for constructions like "Who won? asked Kim." (not frequent in our data set, though seen in Rondane).
  2. Removed STATIVE from grammar, since no longer used
  3. Removed spurious fragment rules only used for parsing dictionary definitions
  4. Corrected lexical predicates in SEM-I _have_v_to_rel => "_have_v_to_rel" (from type to string) "_fail1_v_1_rel" => "_fail_v_1_rel" (misspelling)
  5. Added missing lexical entry for unaccusative (intransitive) "weaken"
  6. Added lexical entries for "move" and "drive" analogous to "put", still using the same inventory of predicates in the SEM-I.
  7. Split the lexical rule for prenominal verbal modifiers into two rules, one for present participles and one for passives, to avoid spurious verb-particle entries which should be disallowed as modifiers (since the particle can't be present).
  8. Modified the types for raising verbs taking an infinitival VP complement so they uniformly combine with the infinitival "to" which introduces a message.
  9. Added reentrancies for TPC and PSV so the appropriate values appear on messages in embedded clauses.
  10. Improved generator efficiency by adding grammar-internal feature --TPC which new generator compliance rules assign a value based on the public feature TPC.
  11. Also further refined trigger rules, and exploited the newly invented compliance rules which adjust the input MRS to comply with grammar-internal constraints (so far restricted to assigning value for --TPC based on TPC.
  12. Again for efficiency, added constraints on events introduced by adverbs and degree specifiers so they will not trigger lexical entries in generation.
  13. Once again corrected the reported failure to generate some examples like "Abrams could." which made use of ellipsis_rel as underspecification of ellipsis_ref_rel.

Release notes for version "LinGO (05-Sep-05)"

Improved generation with punctuation and fragments. Updated Verbmobil section of Redwoods treebank, and filled in missing gold profiles.

Release notes for version "LinGO (02-Sep-05)"

Minor update: Modified trigger rules to use unification rather than subsumption, and added some abstractions over trigger rules, in mtr.tdl Further reduced spurious commas preceding modifiers in generation. Punctuation rules now compatible with current LKB morphology. Infinitival subjects no longer introduce nominalization (as in "To err is human.")

Release notes for version "LinGO (15-Aug-05)"

Minor update: The usual normalizing of predicate names, this time mostly for expletive-it-taking predicates. Also some futher tuning of trigger rules, and change to verb_synsem to make sure uninflected lexical entries already identify their INDEX and KEYREL.ARG0, for better generator initialization.

Release notes for version "LinGO (09-Aug-05)"

Minor update for yet more consistency in predicate names, especially for relational nouns and adjectives, respectively, to get their related entries to match in predicate names. Also corrected ordering error in prp_infl_rule and added a few additional lexical entries for the LOGON development corpus.

Release notes for version "LinGO (05-Aug-05)"

Minor update to improve consistency in predicate naming conventions, and to restore the 'chunking' roots in roots.tdl which are used experimentally in trying to generate from fragmented MRSs.

Note that in this release, only the 'gold' profiles for 'csli', 'mrs', and 'hike' have been updated.

Release notes for version "LinGO (Jul-05)"

This release incorporates several significant changes to the previous release, but at long last also includes a first step at documenting an external semantic interface for the grammar. The changes will soon be described in a little more detail on the ERG Wiki, but in summary:

  1. Punctuation as affixation

    Previous versions of the grammar implemented a treatment of punctuation adopting a standard but linguistically dubious strategy of using a preprocessor to make all punctuation marks distinct tokens, adding spaces around each one. This version implements an analysis which leaves the input string unchanged with respect to punctuation (except for apostrophes), and treats the punctuation marks as spell-changing affixes. This change creates backward incompatibilities with earlier treebanks because the tokenization for each sentence is now different. A few infelicities remain from making this change, including

    • minor inconsistencies in the readers of affixation rules for the LKB and PET (and even for previous and current versions of the LKB)
    • imperfect interaction of irregular inflected forms and punctuation
    • imperfect interaction of multi-words and punctuation There are work-arounds for some of these, awaiting better resolution.
  2. Semantics

    a. Semantically empty prepositions no longer introduce an EP (they used to add an EP whose predicate name ended in "_sel_rel", for lexically 'selected'). So the generator trigger rules have been augmented to automatically introduce the necessary lexical entries for generation, currently based on predicate-naming conventions for the lexical entries that select empty prepositions. b. Messages now introduce an additional attribute, ARG0, whose value is the event of the highest-scoping verbal EP within the scope of the message. The main motivation is to make it simpler for applications to identify the relevant event properties of a clause's semantics without looking 'inside' the clause's MRS. c. All lexical predicates now have some value in the 'sense' field of the predicate name (Background: by convention in the ERG, each lexical predicate name has the following form: _ORTH_POS_SENSErel where ORTH is the lexeme's orthography, POS is a coarse-grained sense distinction drawing from the vocabulary [v n a p x q c], and SENSE is an arbitrary sequence of characters (excluding ||), and where each of the fields is separated by an underscore. Earlier, the sense field could have been left empty.) The default value for the sense field is now '1'. d. Relational nouns now specify in their sense field the orthography of the preposition marking their oblique complement (usually 'of'). e. Tag questions previously discarded the semantics of the tag phrase, contrary to the monotonicity assumption in the ERG. This is now corrected, with the result that the semantics of sentences with tag questions is now rather more baroque. The main benefit of the reanalysis is that lexical rules now properly always preserve the semantics of their input lexemes. f. Sentential subjects were previously analyzed via a nominalization rule. This simplified the syntactic analysis of "That Abrams arrived annoyed Browne" since the "annoy" lexeme could always unify its ARG1 value with the semantic index of its subject. But the resulting asymmetry for the 'extraposed' and non-extraposed variants of lexemes like 'annoy' was annoying. This version of the grammar now provides the same MRS for both variants ('It annoyed Browne that Abrams arrived' and the above example), via a syntactic variant of an 'it-extraposition' lexical rule, with thanks to Ann Copestake for the suggested implementation. One consequence is that the earlier treatment of examples like "The problem was that Abrams arrived" no longer works, since the identity copula was being used, and requires its complement to supply a referential index. So there is also yet another entry for the verb 'be', which supplies an EP similar to the identity 'be'. g. Verbal modifiers of nouns were being given an inconsistent semantics, with postnominal modifiers as in 'people singing arias' supplying a message for the modifier phrase, but with prenominal modifiers as in 'the singing people' not contributing a message. In this version of the grammar, verbal projections now always supply a message, making the world a little more consistent, but leaving a sharper contrast now between "the singing children" and "the interesting children" where 'interesting' is analyzed as an adjective and hence does not supply a message.

  3. Lexicon

    New lexical entries have been added drawn from the Norwegian tourism domain of the LOGON development corpus, bringing the current number of lexemes to 22,750 for this release, of which about 2700 are proper names.

  4. SEM-I

    A first draft of the semantic interface for the grammar is now presented in the file erg-full.smi, including the predicate names and semantic arguments of all predicates introduced either by lexical entries or by the grammar (either via lexical/syntactic rules or via abstractions over more specific predicates). Documentation of this file is under active development.

  5. Naming conventions

    The feature name DIVISIBLE on referential indices has been shortened to DIV for better readability of MRSs.

  6. LKB warnings on grammar loading

    The LKB's new and improved treatment of morphology offers several advantages, and the current version of the grammar benefits from these, but still results in some warning messages when loading.
    Users can ignore these messages for now, while the developers resolve the underlying causes. The first is about the 'punct_bang_rule', and the others warn of lexical rules that can feed themselves.

Release notes for version "LinGO (30-Apr-05)"

This is a minor update to the Apr-05 version, including some lexical additions, adjustments to the semantic predicate hierarchy, and tuning of syntactic analyses, all designed to improve end-to-end translation for LOGON. The only substantive difference is in the analysis of possessive constructions, where the grammar now produces nearly identical MRSs for the two noun phrases "our book" and "a book of ours", using a new lexical entry for "ours" distinct from the ordinary "ours" of "ours are not ready". One consequence of this reanalysis, which unifies the treatment of the two possessive constructions, is that the two arguments in the old 'poss_rel' EP have been reversed: what was the ARG1 is now ARG2, and vice versa.

Release notes for version "LinGO (Apr-05)"

Overview of changes:


BNC - Based on months of hard labor by former Stanford students Hansook Lee and Mike Orme (with help from Ara Kim), the lexicon now contains all verb subcat entries for the 2000 most frequent verb stems in the British National Corpus. This should enable some interesting experimentation in automated lexical acquisition, since there are fewer lexical types that need to be hypothesized for non-verbs.

GCIDE - The lexicon now also contains entries for all words observed in the first 10,000 definition 'sentences' in the GNU Contemporary International Dictionary of English (GCIDE), to enable more precise evaluation of syntactic coverage of these definitions.

Shanghai - Based on some 1500 entries constructed by Yi Zhang at CoLI in Saarbruecken, the lexicon now also contains entries for most of the words found in a Web-derived corpus on tourism in Shanghai, analogous to the Rondane corpus built by Becky Neil for the LOGON project in Norway.

--MRS quality--

Based on a substantial implementation effort by Stefan Thater and colleagues at CoLi, Saarbruecken, to check for well-formedness of MRSs produced by the grammar for the Redwoods and Rondane corpora, many errors were identified, enabling improvements in MRS construction in the ERG. Further improvements were enabled by the systematic use of existing capabilities in the LKB for diagnosing MRS errors in ERG analyses. While the current release still produces some flawed MRSs for these data sets, they are largely confined to a small inventory of known and somewhat problematic minor phenomena.


Drawing on the combined expertise of Stephan Oepen and Francis Bond, the ERG is now fully Unicode-compliant, including the PSQL database. This enables proper representation in the lexicon for orthography of non-English proper names such as "østerbø", and archaic English spellings such as "coöperation". The necessary infrastructure for Unicode is admirably and demonstrably in place in the LKB, PET, [incr tsdb()], and PostgreSQL.


Fragments - Further work on the treatment of fragments has been motivated largely by the effort to parse the definition sentences in GCIDE, and to give them a consistent semantic representation. New fragment types now licensed include VPs and PPs with NP gaps, as in "To devour." or "Relying on.".

Locative inversion - The grammar now analyzes some locative inversion phenomena, currently restricted to sentences headed by the finite copula 'be' as in "Near the park is a large dog" but not (yet) "Near the park stood a large tree". These appear with some frequency in the Rondane data, and have also been waiting patiently for twenty years in the CSLI test suite.

'Free' parentheticals - Sentences containing some classes of parenthetical material (which would not survive in situ without the parentheses) will now be analyzed, though further work will be needed in designing the target semantics. Example now covered: "That dog (you should see its owner!) barked."

--Changed analyses--

Modification - Based on more systematic analysis of phenomena found in the Rondane corpus, and corroborated in the Shanghai corpus, the ERG now permits more interesting modification structures. Prepositional phrases, formerly restricted to modifying only VPs and nominal phrases, can now also modify adjective phrase and other PPs. Similarly, adverbs can now also modify adjective phrases, as in "the wildly happy dog barked", freeing the grammar from its former requirement that duplicate degree-specifier lexical entries be added for many adverbs.

--New domains--

The GCIDE corpus has been taken from the GCIDE web site, and carefully prepared by Eric Nichols at NTT in collaboration with Francis Bond, including identification of sentence breaks, normalization, and formatting, all of which are now automated via Perl scripts converting the original GCIDE data into, among other things, an 'item' file format for use with the fine system.

The Shanghai corpus is being collected by Yi Zhang in Saarbruecken as part of his thesis work, and consists of text on tourism in Shanghai, written in English and mostly but not entirely by native English speakers. The corpus may still be revised, so a profile of this data is not (yet) being distributed with the ERG.