acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
431 stars 288 forks source link

<von> part of name not rendered by website #70

Closed andreasvc closed 6 years ago

andreasvc commented 6 years ago

Compare: http://aclweb.org/anthology/C18-2009.bib With: https://aclanthology.coli.uni-saarland.de/papers/C18-2009/c18-2009.bib

The second one lists my name as Cranenburgh, Andreas, instead of what I intended van Cranenburgh, Andreas. This error also appears on the website at https://aclanthology.coli.uni-saarland.de/events/coling-2018#C18-2

I grep'ed the repo and discovered that in import/C18.xml my name is listed as:

<author><first>Andreas</first><von>van</von><last>Cranenburgh</last></author>

So the problem is that the material in the <von> tag is not being rendered.

IMHO it is better to avoid the complexity of making a special case of <von>, it raises many questions of how it should be handled, and these may differ by locale. Since I don't expect everyone to know my native language's name sorting rules, I like to keep the "van" as a fixed part of my last name, and have my name sorted under "v", not "C".

Where does the <von> tag come from? i don't see it as a separate field in my SoftConf profile. Is there a way I can ensure my name is listed as <last>van Cranenburgh</last> ? Curiously, this is how my name is listed in https://github.com/acl-org/acl-anthology/blob/master/import/W18.xml

CTNLP commented 6 years ago

The xml used to generate the entries in the anthology is in turn generated by a script that uses softconf data as a starting point. As far as I can tell either there is an error in the script or C18.xml was created in some other way. There should not be any tags in the xml. There are only 6 occurrences of the tag in the file. I will fix the file in the next few days and then we can use the corrected version to update our database. As a larger issue, we should maybe include some more xml validation before ingestion in the future, that should make it easier to track these issues to their root.

CTNLP commented 6 years ago

So it seems this issue is actually worse than I thought and there are a large number of files affected - it seems that there are a lot of people who have there name recorded incorrectly but never complained. I will still try to fix this asap. But it will probably require writing some code.

Thanks for bringing this to our attention.

CTNLP commented 6 years ago

Produced a simple fix by just moving the part into the last name the associated pull request is #75 . @villalbamartin and @knmnyn can you verify that the changes are correct and then approve the request? Then @villalbamartin could re-do the ingestion with the changed files and the problem should disappear.

knmnyn commented 6 years ago

Hi all:

I will try to check it next week. I have no problem if Martin can do the PR review and merge it if it doesn't cause any problems. The field definitely was supposed to be a feature when it was originally introduced by Steven Bird, but it hasn't been widely used and has caused misunderstanding and uneven usage before. I agree it should probably be deprecated.

Cheers,

Min

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Mon, Aug 27, 2018 at 10:54 AM Christoph Teichmann < notifications@github.com> wrote:

Produced a simple fix by just moving the part into the last name the associated pull request is #75 https://github.com/acl-org/acl-anthology/pull/75 . @villalbamartin https://github.com/villalbamartin and @knmnyn https://github.com/knmnyn can you verify that the changes are correct and then approve the request? Then @villalbamartin https://github.com/villalbamartin could re-do the ingestion with the changed files and the problem should disappear.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/70#issuecomment-416292172, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP61lsXdMd-iHGp0WCfjrWxPSx-zA5ks5uVCRLgaJpZM4V6PCj .

CTNLP commented 6 years ago

Martín just agreed to do the review and test things on our test server. If everything works then he will approve the pull request and try to incorporate things on the production server.

danielgildea commented 6 years ago

I think everything is fixed in the scripts and the .xml files now, and we just need to rebuild the database on the production server. Martin, could you do that please?

villalbamartin commented 6 years ago

Sorry, this one slipped under my radar. I'll get to it now.

villalbamartin commented 6 years ago

After a couple days of database crunching, I've now rebuilt a copy of the database and I'm ready to switch the server from one to the other. I double checked the data to make sure everything was right, and found the following bug.

Taking "Andreas van Cranenburgh" as an example, we unified most entries like so:

<author><first>Andreas</first><last>van Cranenburgh</last></author>

But there is a conference entry (L12.xml) that reads

<author>Andreas van Cranenburgh</author>

As a result, the first type of entry created a record in the people table that reads

5415 | Andreas | van Cranenburgh | Andreas van Cranenburgh

While the second one created a record that reads

19861 | Andreas van | Cranenburgh | Andreas van Cranenburgh

A quick SQL query reveals 262 names with the same issue. You can run the following query to get all names:

select full_name, count(full_name) from people group by full_name having count(full_name) > 1 order by full_name asc;

This issue is already present in the production database, so we could in theory switch to the "less wrong" version and fix the bug later, or we can wait until we solve it once and for all.

CTNLP commented 6 years ago

I just talked to @villalbamartin and unless we get some input to the contrary we will proceed as follows:

@villalbamartin will put the version of the anthology with the improvements that we have made to the xml so far online (fixing the problem with most of the names). In the next few weeks I will then take a closer look at the problem that @villalbamartin just discovered and suggest a fix.

danielgildea commented 6 years ago

I committed a perl script that addresses the problem of normalizing names. It adds <first> and <last> tags to the xml files that don't have them by processing the names according to bibtex rules. I think running this script and then rebuilding the database should fix the problem.

villalbamartin commented 6 years ago

I have now ran the script, regenerated the testing database, and inspected the results.

The script removed 43 duplicates (which is awesome) and introduced 4 new ones (which is slightly less awesome). The new ones are

The original issue, however, seems to be fixed. I have now ran the script on the testing server and everything seems fine with respect to the original report, so we can soon close this issue and move discussion about what to do with duplicates to #86.

danielgildea commented 6 years ago

The file referenced in the original bug report still needs to be regenerated.

https://aclanthology.coli.uni-saarland.de/papers/C18-2009/c18-2009.bib

The page for Coling 2018 also does not include the "van", though it seems to be correct in the .xml import file.

villalbamartin commented 6 years ago

The updated database is up and the search has been updated. The .bib files have been regenerated, and all other XML file formats will follow eventually (a couple hours).

This should be the end of this bug. Once the process is done, I'll close the bug.