acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
414 stars 282 forks source link

Paper labeling in C69.xml #147

Closed mjpost closed 5 years ago

mjpost commented 5 years ago

The C69 issue is a consequence of faulty XML, in my opinion. In the database import script, IDs of the form "x000" get interpreted as proceedings volumes, except for workshops, where it is IDs of the form "xx00". In C69.xml, volumes follow the workshop format, which leads to the first nine papers being ignored until a "proper" volume ID (1000) is found.

However, IMHO, the actual issue is that each paper has its own proceedings entry, which doesn't seem correct or useful to me. I believe the file should have a single proceedings entry with ID 1000, and the individual papers should be renumbered 0101 -> 1001, 0201 -> 1002, 0301 -> 1003, and so on.

Originally posted by @mbollmann in https://github.com/acl-org/acl-anthology/issues/107#issuecomment-465190070

davidweichiang commented 5 years ago

The index page for C69 (https://www.aclweb.org/anthology/events/coling-1969/) is display weirdly, with many papers repeated and many papers omitted. I know that the numbering is wrong, but even so, should they be displaying like this?

mjpost commented 5 years ago

C69 just needs to be reorganized. Each paper has two entries: one for the preprint (an abstract), and the other for the paper (e.g., https://aclweb.org/anthology/C69-0100.pdf and https://aclweb.org/anthology/C69-0101.pdf).

I suggest that we

So for example:

<volume id="C69">
 <paper id="0100">
 <title>INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS COLING 1969: Preprint No. 1</title>
 </paper>

 <paper id="0101">
 <title>TREE GRAMMARS (= Δ-GRAMMARS)</title>
 <author><last>Mel’čuk</last><first>I. A.</first></author>
 <author><last>Gladky</last><first>A. V.</first></author>
</paper>

would become

<volume id="C69">
 <paper id="1001">
 <title>TREE GRAMMARS (= Δ-GRAMMARS)</title>
 <author><last>Mel’čuk</last><first>I. A.</first></author>
 <author><last>Gladky</last><first>A. V.</first></author>
</paper>

Thoughts? CC: @villalbamartin @danielgildea @mbollmann @davidweichiang

mjpost commented 5 years ago

Update: there are also a number of "post-prints", which appear to be commentary after the conference, a fascinating idea that should be revived. Also this document ("Die Mälarinseln und ihre Sehenswürdigkeiten Allgemeines über die Gegend) for @mbollmann. I would retain entries for them and also fold them into the full proceedings.