Open olzama opened 2 years ago
Here's the list of the files in the current release. I filled out the mapping where I could make them, but some things I cannot map to something described in Flickinger 2011 easily:
csli | Constructed |
ccs | Constructed |
control | Constructed |
esd | Constructed |
fracas | Constructed |
handp12 | Handpicked? From where? |
mrs | Constructed |
pest | ??? |
sh-spec | Sherlock Holmes |
sh-spec-r | Sherlock Holmes |
trec | Constructed |
cb | The Cathedral and the Bazaar |
ec* | E-commerce |
hike | LOGON |
jh* | LOGON |
tg* | LOGON |
ps* | LOGON |
rondane | LOGON |
rtc* | ??? |
bcs | ??? |
scm | SemCor?.. |
vm* | Verbmobil |
ws* | Wikipedia |
wlb03 | ??? |
wnb03 | ??? |
peted | ??? |
petet | ??? |
ntucle | ??? |
omw | ??? |
wsj* | Wall Street Journal |
Can anyone help complete this table?
The OMW means Open Multilingual Wordnet. This is a sample of 2000 sentences from synset English definitions from @fcbond 's OMW (http://compling.hss.ntu.edu.sg/omw/). As far as I know, this is mostly the Princeton Wordnet 3.0 with small fixes.
I agree that we do need this documentation in the wiki. BTW, not always clear that WSJ is also part of Ontonotes and Propbank dataset (see https://github.com/propbank/propbank-release/issues/14).
Is ws*
all the https://github.com/delph-in/docs/wiki/WikiWoods? What does the star mean?
Can anyone help complete this table?
Is
ws*
all the https://github.com/delph-in/docs/wiki/WikiWoods? What does the star mean?
This just means, all corpora that start with "ws".
Thank you, @oepen !
Here's the same table updated with the info from index.lisp
. For some items, I am still missing an adequate description though...
csli | "CSLI testsuite" | Constructed examples | is there a citation?.. | and what exactly does it mean?.. |
ccs | "Collab Compuational Semantics" | Constructed examples | citation? | and I don't know what this means either... |
control | "Control examples from literature" | Constructed examples | clear enough I guess | though some provenance would be nice |
esd | "ERG Semantic Documentation Test Suite" | Constructed | https://github.com/delph-in/docs/wiki/ErgSemantics | |
fracas | "FraCaS Semantics Test Suite" | textual inference problem set? | Cooper et al. 1996 | https://gu-clasp.github.io/multifracas/D16.pdf |
handp12 | The Cambridge grammar of the English language, Ch12 | ??? | Huddleston and Pullum 2005 | Not available online; What is the relationship of the chapter and the test suite? |
mrs | MRS test suite | Constructed examples | https://github.com/delph-in/docs/wiki/MatrixMrsTestSuite | |
pest | ??? | ??? | ??? | ??? |
sh-spec | Sherlock Holmes | late 19th century fiction | Conan Doyle, 1892 | https://www.gutenberg.org/files/1661/1661-h/1661-h.htm#chap08 |
sh-spec-r | what's this second one? | |||
trec | "TREC QA Questions (Ninth conference" | Constructed examples? | Can't find this specific event | |
cb | The Cathedral and the Bazaar | technical essay | Raymond, 1999 | http://www.catb.org/~esr/writings/cathedral-bazaar/ |
ec* | E-commerce email (YY) | email (customer service etc) | ||
hike | LOGON | travel brochures | ||
jh* | LOGON | travel brochures | ||
tg* | LOGON | travel brochures | ||
ps* | LOGON | travel brochures | ||
rondane | LOGON | travel brochures | ||
rtc* | ??? | ??? | ??? | ??? |
bcs | "Brown Corpus Sampler (SDP 2015 Task)" | Oepen et al. 2015 | https://aclanthology.org/S15-2153.pdf | |
scm | "SemCor Melbourne Sampler (Disjoint from BCS)" | same as above?.. | ||
vm* | Verbmobil | scheduling dialogues | Is Wahlster 1993 the citation?.. | http://verbmobil.dfki.de/ww.html |
ws* | Wikipedia | Encyclopaedic texts about computational linguistics?.. | anything more we know about them? | |
wlb03 | ??? | ??? | ??? | ??? |
wnb03 | ??? | ??? | ??? | ??? |
peted | "Evaluation By Textual Entailment (Development)" | what does it mean? | ||
petet | "Evaluation By Textual Entailment (Test)" | what does it mean? | ||
ntucle | Something to do with NTU but what? | |||
omw | Open Multilingual Wordnet | ? | http://compling.hss.ntu.edu.sg/omw/ | |
wsj* | Wall Street Journal | News articles | https://catalog.ldc.upenn.edu/LDC93S6A |
CSLI is a rebranding of the legendary HP test suite: https://www.let.rug.nl/nerbonne/papers/Old-Scans/Toward-Eval-NLP-1987.pdf
for WNB and WLB: http://www.lrec-conf.org/proceedings/lrec2012/pdf/774_Paper.pdf
PETE: https://link.springer.com/article/10.1007/s10579-012-9200-5
Many thanks, @oepen . Let me know if you think that this table could go directly into the wiki e.g. into RedwoodsTop.
What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number? Is there any other ERG golden analyzed treebank besides the data inside the ERG repository under tsdb/gold
?
In the tsdb/gold
we have 131,401 sentences:
% for f in find . -type f -name item.gz
; do echo $f, gzcat $f | wc -l
; done | awk 'BEGIN {s=0} {s = s + $2} END{ print s}'
131401
Two profiles are 'virtual'. The wescience and redwoods. But redwoods mention profiles that do not exist in the tsdb/gold
folder:
Questions:
@oepen, the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets?
Finally, there are sentences duplicated in the profiles:
% for f in */item.gz; do gzcat $f | awk -F "@" '{print $7}' >> sentences; done
% sort sentences| sort | uniq | wc -l
105820
some examples:
% sort sentences| sort | uniq -c | sort -nr | head -20
3288 MIME-Version: 1.0
3288 Content-Type: text/plain; charset=iso-8859-1
3288 Content-Transfer-Encoding: 8bit
303 Message-ID: <1043735849\smv.stanford.edu>
301 Message-ID: <1043735850\smv.stanford.edu>
300 Message-ID: <1043735851\smv.stanford.edu>
295 Message-ID: <1043735854\smv.stanford.edu>
295 Message-ID: <1043735852\smv.stanford.edu>
294 Message-ID: <1043735855\smv.stanford.edu>
292 Message-ID: <1043735853\smv.stanford.edu>
290 Message-ID: <1043735857\smv.stanford.edu>
289 Message-ID: <1043735858\smv.stanford.edu>
289 Message-ID: <1043735856\smv.stanford.edu>
275 Message-ID: <1043735848\smv.stanford.edu>
268 okay.
227 From: stefan\syy.com
204 From: dan\syy.com
202 From: monique\syy.com
200 From: remy\syy.com
200 From: brian\syy.com
What about adding some extra information about the size of each treebank? I am particularly interested to know how many sentences we have with golden MRS. Does anyone have this number?
Alex, the redwoods.xlsx file (which you can find in the release) has the sentence numbers!
I found a link to the redwoods.xls file https://github.com/delph-in/docs/wiki/RedwoodsTop. But the page is pointing to http://svn.delph-in.net/erg/tags/1214/etc/redwoods.xls. In the etc folder of ERG in the trunk branch of the repository, I found the new version of this file.
If I am reading it right, we have 97,286 sentences fully disambiguated in the redwoods collection, right? Still the more than the 59,255 AMR sentences but less impressive number. Is this number the actually number of sentences with golden MRS that we have available? What is the status of the sentences under the profiles not included in the redwoods?
I noticed that sh-spband-r
profile is not listed in the spreadsheet redwoods.xls
. What is this?
the CCS event is the precursor of http://mrp.nlpl.eu/2020/index.php?page=14#companion? If so, what is the origin of the EDS data on MRP datasets?
broadly speaking, i guess one could say that CCS (and a series of additional meetings in a similar spirit) was part of the build-up for the MRP shared tasks. but one could just as well say that the desire to compare different frameworks and specific analyses has been a motivating force for dan, emily, myself, and others for at least the past decade. sitting down to compare individual sentences in great depth (in the CCS spirit) is one technique we have used; the SDP and MRP shared tasks series was a different approach with some of the same underlying motivation.
regarding the EDS data in MRP 2019 and 2020, it comes from the 1214 ERG release, aka DeepBank 1.1.
- Instead of "jh0", "jh1", "jh2", "jh3", "jh4" and "jh5" we have only the profiles "jh", "jhk" and "jku"
- Instead of "tg1" and "tg2" we have "tg", "tgk" and "tgu"
- Instead of "sc01", "sc02" and "sc03" we have only "scm"
yes, with the transition from the original [incr tsdb()]-based treebanking environment to FFTB, profiles became a lot smaller, seeing as only the packed forest is recorded rather than a 500-best list of full derivations for each input. that meant that dan could undo some sub-divisions of collections that logically belonged together (JH, TG, and SC). post-1214, he concatenated these profiles back together.
So we also have DeepBank in addition to the wesearch and redwoods "virtual" profiles? According to https://github.com/delph-in/docs/wiki/DeepBank it is the wsj*
profiles. These are released in http://metashare.dfki.de/repository/browse/deepbank/d550713c0bd211e38e2e003048d082a41c57b04b11e146f1887ceb7158e2038c/ summing up to 43,541 Sentences. But I suppose the WSJ*
profiles under the Erg repository are updated with ERG 2020 release, so the META-SHARE website data is outdated.
I suggest that we expand the section about the datasets that constitue the ERG treebanks: https://github.com/delph-in/docs/wiki/RedwoodsTop
Currently, the wiki page refers the reader to Flickinger 2011 but that work is not easily available online (I don't think?) Furthermore, even if one has it, it is still not fully obvious how to map the datasets described there to the files in the ERG release (for some, it is obvious, for others, it is not).