ERG Treebanks: summary about data

olzama commented 2 years ago

I suggest that we expand the section about the datasets that constitue the ERG treebanks: https://github.com/delph-in/docs/wiki/RedwoodsTop

Currently, the wiki page refers the reader to Flickinger 2011 but that work is not easily available online (I don't think?) Furthermore, even if one has it, it is still not fully obvious how to map the datasets described there to the files in the ERG release (for some, it is obvious, for others, it is not).

olzama commented 2 years ago

Here's the list of the files in the current release. I filled out the mapping where I could make them, but some things I cannot map to something described in Flickinger 2011 easily:


csli	Constructed
ccs	Constructed
control	Constructed
esd	Constructed
fracas	Constructed
handp12	Handpicked? From where?
mrs	Constructed
pest	???
sh-spec	Sherlock Holmes
sh-spec-r	Sherlock Holmes
trec	Constructed
cb	The Cathedral and the Bazaar
ec*	E-commerce
hike	LOGON
jh*	LOGON
tg*	LOGON
ps*	LOGON
rondane	LOGON
rtc*	???
bcs	???
scm	SemCor?..
vm*	Verbmobil
ws*	Wikipedia
wlb03	???
wnb03	???
peted	???
petet	???
ntucle	???
omw	???
wsj*	Wall Street Journal

olzama commented 2 years ago

Can anyone help complete this table?

arademaker commented 2 years ago

The OMW means Open Multilingual Wordnet. This is a sample of 2000 sentences from synset English definitions from @fcbond 's OMW (http://compling.hss.ntu.edu.sg/omw/). As far as I know, this is mostly the Princeton Wordnet 3.0 with small fixes.

arademaker commented 2 years ago

I agree that we do need this documentation in the wiki. BTW, not always clear that WSJ is also part of Ontonotes and Propbank dataset (see https://github.com/propbank/propbank-release/issues/14).

arademaker commented 2 years ago

Is ws* all the https://github.com/delph-in/docs/wiki/WikiWoods? What does the star mean?

oepen commented 2 years ago

Can anyone help complete this table?

http://svn.delph-in.net/erg/trunk/tsdb/skeletons/Index.lisp

olzama commented 2 years ago

Is ws* all the https://github.com/delph-in/docs/wiki/WikiWoods? What does the star mean?

This just means, all corpora that start with "ws".

olzama commented 2 years ago

Thank you, @oepen !

Here's the same table updated with the info from index.lisp. For some items, I am still missing an adequate description though...


csli	"CSLI testsuite"	Constructed examples	is there a citation?..	and what exactly does it mean?..
ccs	"Collab Compuational Semantics"	Constructed examples	citation?	and I don't know what this means either...
control	"Control examples from literature"	Constructed examples	clear enough I guess	though some provenance would be nice
esd	"ERG Semantic Documentation Test Suite"	Constructed		https://github.com/delph-in/docs/wiki/ErgSemantics
fracas	"FraCaS Semantics Test Suite"	textual inference problem set?	Cooper et al. 1996	https://gu-clasp.github.io/multifracas/D16.pdf
handp12	The Cambridge grammar of the English language, Ch12	???	Huddleston and Pullum 2005	Not available online; What is the relationship of the chapter and the test suite?
mrs	MRS test suite	Constructed examples		https://github.com/delph-in/docs/wiki/MatrixMrsTestSuite
pest	???	???	???	???
sh-spec	Sherlock Holmes	late 19th century fiction	Conan Doyle, 1892	https://www.gutenberg.org/files/1661/1661-h/1661-h.htm#chap08
sh-spec-r			what's this second one?
trec	"TREC QA Questions (Ninth conference"	Constructed examples?		Can't find this specific event
cb	The Cathedral and the Bazaar	technical essay	Raymond, 1999	http://www.catb.org/~esr/writings/cathedral-bazaar/
ec*	E-commerce email (YY)	email (customer service etc)
hike	LOGON	travel brochures
jh*	LOGON	travel brochures
tg*	LOGON	travel brochures
ps*	LOGON	travel brochures
rondane	LOGON	travel brochures
rtc*	???	???	???	???
bcs	"Brown Corpus Sampler (SDP 2015 Task)"		Oepen et al. 2015	https://aclanthology.org/S15-2153.pdf
scm	"SemCor Melbourne Sampler (Disjoint from BCS)"		same as above?..
vm*	Verbmobil	scheduling dialogues	Is Wahlster 1993 the citation?..	http://verbmobil.dfki.de/ww.html
ws*	Wikipedia	Encyclopaedic texts about computational linguistics?..		anything more we know about them?
wlb03	???	???	???	???
wnb03	???	???	???	???
peted	"Evaluation By Textual Entailment (Development)"			what does it mean?
petet	"Evaluation By Textual Entailment (Test)"			what does it mean?
ntucle	Something to do with NTU but what?
omw	Open Multilingual Wordnet		?	http://compling.hss.ntu.edu.sg/omw/
wsj*	Wall Street Journal	News articles		https://catalog.ldc.upenn.edu/LDC93S6A

oepen commented 2 years ago

CSLI is a rebranding of the legendary HP test suite: https://www.let.rug.nl/nerbonne/papers/Old-Scans/Toward-Eval-NLP-1987.pdf

for WNB and WLB: http://www.lrec-conf.org/proceedings/lrec2012/pdf/774_Paper.pdf

PETE: https://link.springer.com/article/10.1007/s10579-012-9200-5

oepen commented 2 years ago

CCS: http://moin.delph-in.net/wiki/WeSearch/Ccs

olzama commented 2 years ago

Many thanks, @oepen . Let me know if you think that this table could go directly into the wiki e.g. into RedwoodsTop.

arademaker commented 2 years ago

What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number? Is there any other ERG golden analyzed treebank besides the data inside the ERG repository under tsdb/gold?

In the tsdb/gold we have 131,401 sentences:

% for f in find . -type f -name item.gz; do echo $f, gzcat $f | wc -l; done | awk 'BEGIN {s=0} {s = s + $2} END{ print s}' 131401

Two profiles are 'virtual'. The wescience and redwoods. But redwoods mention profiles that do not exist in the tsdb/gold folder:

Instead of "jh0", "jh1", "jh2", "jh3", "jh4" and "jh5" we have only the profiles "jh", "jhk" and "jku"
Instead of "tg1" and "tg2" we have "tg", "tgk" and "tgu"
Instead of "sc01", "sc02" and "sc03" we have only "scm"

Questions:

Should we update the profile redwoods, that is, the list in the virtual file?
The last AMR dataset contains 59,255 sentences. As far as I understood from https://amr.isi.edu/download.html and https://catalog.ldc.upenn.edu/LDC2020T02, this AMR 3.0 contains AMR 2.0 and AMR 1.0, so AMR data is ~45% of the size of MRS data, am I right?

arademaker commented 2 years ago

@oepen, the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets?

Finally, there are sentences duplicated in the profiles:

% for f in */item.gz; do gzcat $f | awk -F "@" '{print $7}' >> sentences; done
% sort sentences| sort | uniq | wc -l
  105820

some examples:

% sort sentences| sort | uniq -c | sort -nr | head -20
3288 MIME-Version: 1.0
3288 Content-Type: text/plain; charset=iso-8859-1
3288 Content-Transfer-Encoding: 8bit
 303 Message-ID: <1043735849\smv.stanford.edu>
 301 Message-ID: <1043735850\smv.stanford.edu>
 300 Message-ID: <1043735851\smv.stanford.edu>
 295 Message-ID: <1043735854\smv.stanford.edu>
 295 Message-ID: <1043735852\smv.stanford.edu>
 294 Message-ID: <1043735855\smv.stanford.edu>
 292 Message-ID: <1043735853\smv.stanford.edu>
 290 Message-ID: <1043735857\smv.stanford.edu>
 289 Message-ID: <1043735858\smv.stanford.edu>
 289 Message-ID: <1043735856\smv.stanford.edu>
 275 Message-ID: <1043735848\smv.stanford.edu>
 268 okay.
 227 From: stefan\syy.com
 204 From: dan\syy.com
 202 From: monique\syy.com
 200 From: remy\syy.com
 200 From: brian\syy.com

olzama commented 2 years ago

What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number?

Alex, the redwoods.xlsx file (which you can find in the release) has the sentence numbers!

arademaker commented 2 years ago

I found a link to the redwoods.xls file https://github.com/delph-in/docs/wiki/RedwoodsTop. But the page is pointing to http://svn.delph-in.net/erg/tags/1214/etc/redwoods.xls. In the etc folder of ERG in the trunk branch of the repository, I found the new version of this file.

If I am reading it right, we have 97,286 sentences fully disambiguated in the redwoods collection, right? Still the more than the 59,255 AMR sentences but less impressive number. Is this number the actually number of sentences with golden MRS that we have available? What is the status of the sentences under the profiles not included in the redwoods?

I noticed that sh-spband-r profile is not listed in the spreadsheet redwoods.xls. What is this?

oepen commented 2 years ago

the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets?

broadly speaking, i guess one could say that CCS (and a series of additional meetings in a similar spirit) was part of the build-up for the MRP shared tasks. but one could just as well say that the desire to compare different frameworks and specific analyses has been a motivating force for dan, emily, myself, and others for at least the past decade. sitting down to compare individual sentences in great depth (in the CCS spirit) is one technique we have used; the SDP and MRP shared tasks series was a different approach with some of the same underlying motivation.

regarding the EDS data in MRP 2019 and 2020, it comes from the 1214 ERG release, aka DeepBank 1.1.

oepen commented 2 years ago

Instead of "jh0", "jh1", "jh2", "jh3", "jh4" and "jh5" we have only the profiles "jh", "jhk" and "jku"

Instead of "tg1" and "tg2" we have "tg", "tgk" and "tgu"

Instead of "sc01", "sc02" and "sc03" we have only "scm"

yes, with the transition from the original [incr tsdb()]-based treebanking environment to FFTB, profiles became a lot smaller, seeing as only the packed forest is recorded rather than a 500-best list of full derivations for each input. that meant that dan could undo some sub-divisions of collections that logically belonged together (JH, TG, and SC). post-1214, he concatenated these profiles back together.

arademaker commented 2 years ago

So we also have DeepBank in addition to the wesearch and redwoods "virtual" profiles? According to https://github.com/delph-in/docs/wiki/DeepBank it is the wsj* profiles. These are released in http://metashare.dfki.de/repository/browse/deepbank/d550713c0bd211e38e2e003048d082a41c57b04b11e146f1887ceb7158e2038c/ summing up to 43,541 Sentences. But I suppose the WSJ* profiles under the Erg repository are updated with ERG 2020 release, so the META-SHARE website data is outdated.

delph-in / docs

ERG Treebanks: summary about data #40