acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
402 stars 275 forks source link

BibTeX export lacks a month field #94

Closed logological closed 5 years ago

logological commented 5 years ago

The HTML publication records in the Anthology include a month field, but this field does not appear in the BibTeX export. Please consider including this field in the BibTeX record. Note that, wherever possible, the value of this field should be a standard BibTeX month macro (jan, feb, mar, etc.) and not a string ({January}, "February", etc.) as this allows bibliographies to be localized by the BibTeX style.

mjpost commented 5 years ago

@villalbamartin, I can confirm this is correct. The original bib file for D18-1350, for example, has the month field, and this is preserved in the XML, but it is not present in the regenerated bib file.

How hard would this be to add the month field to the DB export?

knmnyn commented 5 years ago

Hi Matt, all:

It'd be pretty easy to do, and an #enhancement that you might want to prioritize at some point, either through Github Projects, or Issue tags or milestones. Softconf also now can export which we are encoding in the XML but this field also not ingested in the Anthology DB format yet.

Cheers,

Min

-- Min-Yen KAN (Dr) :: Associate Professor :: National University of Singapore :: NUS School of Computing, AS6 05-12, 13 Computing Drive Singapore 117417 :: +65 6516 1885(DID) :: +65 6779 4580 (Fax) :: kanmy@comp.nus.edu.sg (E) :: www.comp.nus.edu.sg/~kanmy (W)

On Fri, Jan 4, 2019 at 11:40 PM Matt Post notifications@github.com wrote:

@villalbamartin https://github.com/villalbamartin, I can confirm this is correct. The original bib file for D18-1350 https://aclweb.org/anthology/D18-1350.bib, for example, has the month field, and this is preserved in the XML https://github.com/acl-org/acl-anthology/blob/b6eea2b2f1a00f046ee9ae20aa2cc245cd5abfd6/import/D18.xml#L5992, but it is not present in the regenerated bib file https://aclanthology.coli.uni-saarland.de/papers/D18-1350/d18-1350.bib.

How hard would this be to add the month field to the DB export?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/94#issuecomment-451479025, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP62EWgd5dwV1yfatUO4ZmoXKzII6Rks5u_3XYgaJpZM4YU1wr .

villalbamartin commented 5 years ago

I've looked at the code. The file responsible for the bibtex export function is lib/tasks/bib_export.rake, which first creates an XML file and then uses xml2bib to generate the .bib file (seen here: https://github.com/acl-org/acl-anthology/blob/b6eea2b2f1a00f046ee9ae20aa2cc245cd5abfd6/lib/tasks/bib_export.rake#L288)

I'm not familiar with rexml (which I believe is what the code uses to generate the XML structure to begin with), but I'm willing to believe that it shouldn't be that difficult.

The trickiest part is actually pulling the month from the database. I've run the query SELECT DISTINCT month FROM papers ORDER BY month ASC, and got the following results:

So that part may need to be hand-coded.

mjpost commented 5 years ago

Thanks for this information! It seems the month format is not consistent across entries. But why does this matter? When generating the .bib file for each paper, can't you just use whatever is listed for that paper in the database?

logological commented 5 years ago

Thanks for this information! It seems the month format is not consistent across entries. But why does this matter? When generating the .bib file for each paper, can't you just use whatever is listed for that paper in the database?

You can, but failure to use the standard BibTeX month macros will be a perpetual annoyance to those who want to use our BibTeX entries in non-English documents. (Or even in English ones if the bibliography style formats months differently than whatever fixed string the database uses. For example, some bibliography styles mandate abbreviated month names like "Sept." instead of "September". BibTeX will handle this automatically when the value of the month field is sep, but not when it is {September} or "September".)

mjpost commented 5 years ago

Agreed, but these are two separate issues: (a) properly generating the BibTeX files from our authoritative repository and (b) ensuring that the authoritative repository is correct. For (a), we can use a quick-fix of just exporting the data that is already there. Going into the future, we should both (b.1) fix the ingest procedures to enforce correctness (#90; https://github.com/acl-org/acl-pub/issues/9) and, longer-term (b.2) fix the current data.

So the question is one of correctness versus completeness---whether it is worthwhile to output the existing month field as-is from the database as a short-term fix, or whether we should wait until the data is fixed. I'm inclined towards the former.

There are also other scattered issues, e.g., hyphens where there should be n-dashes in ranges.

villalbamartin commented 5 years ago

I wrote a patch for this - I'm testing it now on the test server, but it will take ~12hs to finish. Here's my proposed solution:

Once I've checked that the current code works, I'll publish a pull request.

knmnyn commented 5 years ago

Hi Martín, all:

For one month cases, it should be ok.

For those with more than one month, probably better to stay with just having the current string, as-is.

For the special case where assumedly a date is there "7-8" June, sounds like that was a coding error. We can remove the date and code "June".

What do you think?

On Sun, Jan 6, 2019 at 3:09 AM Martín Villalba notifications@github.com wrote:

I wrote a patch for this - I'm testing it now on the test server, but it will take ~12hs to finish. Here's my proposed solution:

If the month is not available, we simply output the year If the month is available, we take the first 3 lowercase letters and turn this into a month. In this way, a case like "September/December" is assumed to be September. Given that the month field only can contain one month, the only other alternative is to not include month for publications with a month in this format. The special case "7-8 June" is handled separately, since it's the only field that does not do anything reasonable with the heuristic from the previous point For any other kind of improper input, we only output the year

Once I've checked that the current code works, I'll publish a pull request.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

villalbamartin commented 5 years ago

Hi Min,

here's a long description of the challenges, mostly for future reference. Spoiler: we can do it, but it's not as straightforward as simply passing a string.

There are two issues to consider if we want to use a multi-month format, one of "pipeline" and one of what Bibtex parsers expect.

From a pipeline point of view, the current story goes like this:

  1. The year and month fields are retrieved as-is from the database
  2. They are combined in a field dateIssued, which expects a field with format yyyy[-mm]. This information is dumped to a MODS XML file.
  3. All other exported formats (BIB, Word, ENDF, etc.) are obtained from the XML file. The .bib file, for instance, is the result of running the xml2bib utility with the XML file as input.

Therefore, if we want a multi-month entry we would have to change the format of dateIssued. According to the specification we can extend the dateIssued field with start and end attributes. In that case, we delegate to the utilities how that field is handled, but my tests reveal that xml2bib ignores this attribute and simply chooses the first date in the file. We would therefore have to write code to tweak the output file, generating a standard file first and modifying its month field later. What to put in there is the other point that we need to consider.

According to "LaTeX: A document preparation system, User’s guide and reference manual", Appendix B1.3, the month field should contain a three-letter abbreviation of a month which is later translated to the user's language, so our implementation would not be up to standard. This is not a big issue, since I've noted that most .bib files choose to ignore the standard anyway. If we want to be nice for people using LaTeX in languages other than English, we should use this workaround:

month = jun # "\slash " # jul,

Doing the same for the other formats (Word, ENDF, etc.) would require modifying files according to their own formats, but I'm not really sure how many people actually use those.

logological commented 5 years ago

For those with more than one month, probably better to stay with just having the current string, as-is.

For month ranges, the "official" way of doing this is still to use the three-letter macros. In this case, string concatenation can be used. So month = {January--February} should actually be month = jan # "--" # feb.

mjpost commented 5 years ago

Thanks, all. @villalbamartin, a PR will be a great way to continue conversation.

villalbamartin commented 5 years ago

@mjpost I've done that now. As a side note, issue #107 was created because I was worried that the function date_formatter I wrote would misbehave for strange input. Luckily, it seems all of our input is well-behaved.

mbollmann commented 5 years ago

My recent commit introduced the three-letter macros for the month field. Can someone double-check if there's anything left to be done on this, or if it can be closed?

mjpost commented 5 years ago

This looks good to me. I parsed the whole anthology.bib.gz file with Pybtex and it loaded fine, and entries for conferences that span the month boundary also look good (e.g., N15-4002).

Thanks again to @logological for providing detailed information on the BibTeX spec!