Closed andreasvc closed 6 years ago
The xml used to generate the entries in the anthology is in turn generated by a script that uses softconf data as a starting point. As far as I can tell either there is an error in the script or C18.xml was created in some other way. There should not be any
So it seems this issue is actually worse than I thought and there are a large number of files affected - it seems that there are a lot of people who have there name recorded incorrectly but never complained. I will still try to fix this asap. But it will probably require writing some code.
Thanks for bringing this to our attention.
Produced a simple fix by just moving the
Hi all:
I will try to check it next week. I have no problem if Martin can do the
PR review and merge it if it doesn't cause any problems. The
Cheers,
Min
-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)
On Mon, Aug 27, 2018 at 10:54 AM Christoph Teichmann < notifications@github.com> wrote:
Produced a simple fix by just moving the part into the last name the associated pull request is #75 https://github.com/acl-org/acl-anthology/pull/75 . @villalbamartin https://github.com/villalbamartin and @knmnyn https://github.com/knmnyn can you verify that the changes are correct and then approve the request? Then @villalbamartin https://github.com/villalbamartin could re-do the ingestion with the changed files and the problem should disappear.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/70#issuecomment-416292172, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP61lsXdMd-iHGp0WCfjrWxPSx-zA5ks5uVCRLgaJpZM4V6PCj .
Martín just agreed to do the review and test things on our test server. If everything works then he will approve the pull request and try to incorporate things on the production server.
I think everything is fixed in the scripts and the .xml files now, and we just need to rebuild the database on the production server. Martin, could you do that please?
Sorry, this one slipped under my radar. I'll get to it now.
After a couple days of database crunching, I've now rebuilt a copy of the database and I'm ready to switch the server from one to the other. I double checked the data to make sure everything was right, and found the following bug.
Taking "Andreas van Cranenburgh" as an example, we unified most entries like so:
<author><first>Andreas</first><last>van Cranenburgh</last></author>
But there is a conference entry (L12.xml) that reads
<author>Andreas van Cranenburgh</author>
As a result, the first type of entry created a record in the people
table that reads
5415 | Andreas | van Cranenburgh | Andreas van Cranenburgh
While the second one created a record that reads
19861 | Andreas van | Cranenburgh | Andreas van Cranenburgh
A quick SQL query reveals 262 names with the same issue. You can run the following query to get all names:
select full_name, count(full_name) from people group by full_name having count(full_name) > 1 order by full_name asc;
This issue is already present in the production database, so we could in theory switch to the "less wrong" version and fix the bug later, or we can wait until we solve it once and for all.
I just talked to @villalbamartin and unless we get some input to the contrary we will proceed as follows:
@villalbamartin will put the version of the anthology with the improvements that we have made to the xml so far online (fixing the problem with most of the
I committed a perl script that addresses the problem of normalizing names.
It adds <first>
and <last>
tags to the xml files that don't have them by processing the
names according to bibtex rules. I think running this script and then
rebuilding the database should fix the problem.
I have now ran the script, regenerated the testing database, and inspected the results.
The script removed 43 duplicates (which is awesome) and introduced 4 new ones (which is slightly less awesome). The new ones are
The original issue, however, seems to be fixed. I have now ran the script on the testing server and everything seems fine with respect to the original report, so we can soon close this issue and move discussion about what to do with duplicates to #86.
The file referenced in the original bug report still needs to be regenerated.
https://aclanthology.coli.uni-saarland.de/papers/C18-2009/c18-2009.bib
The page for Coling 2018 also does not include the "van", though it seems to be correct in the .xml import file.
The updated database is up and the search has been updated. The .bib files have been regenerated, and all other XML file formats will follow eventually (a couple hours).
This should be the end of this bug. Once the process is done, I'll close the bug.
Compare: http://aclweb.org/anthology/C18-2009.bib With: https://aclanthology.coli.uni-saarland.de/papers/C18-2009/c18-2009.bib
The second one lists my name as
Cranenburgh, Andreas
, instead of what I intendedvan Cranenburgh, Andreas
. This error also appears on the website at https://aclanthology.coli.uni-saarland.de/events/coling-2018#C18-2I grep'ed the repo and discovered that in
import/C18.xml
my name is listed as:<author><first>Andreas</first><von>van</von><last>Cranenburgh</last></author>
So the problem is that the material in the
<von>
tag is not being rendered.IMHO it is better to avoid the complexity of making a special case of
<von>
, it raises many questions of how it should be handled, and these may differ by locale. Since I don't expect everyone to know my native language's name sorting rules, I like to keep the "van" as a fixed part of my last name, and have my name sorted under "v", not "C".Where does the
<von>
tag come from? i don't see it as a separate field in my SoftConf profile. Is there a way I can ensure my name is listed as<last>van Cranenburgh</last>
? Curiously, this is how my name is listed in https://github.com/acl-org/acl-anthology/blob/master/import/W18.xml