Closed mjpost closed 10 months ago
I prepared a minified version which drops some fields that I don't think are important in bibliography, such as: editor, month, address, and duplicate url + doi. https://github.com/zouharvi/anthology-bib-small
Maybe this could be included in your minification so that instead of having the same issue in 2026, we have it in 2028, if LaTeX/Overleaf is still around by then?
That said, this is a very opinionated change to remove all these fields. However, except for the address, the output in ACL Natbib style would be the same, I believe.
Thanks for the effort and the initiative! As you suspected, though, we can't officially sanction this because it removes information that is important for complete and correct citations. This issue comes up, for example, in tenure and promotion considerations. We recently added the editor field (#2706) which probably contributes to the increased file size.
The approach I think we should take would be to use the string substitutions suggested here.
Another possible solution is to split the ever-growing anthology.bib
into multiple 50MB-each files and have permlink anthology-1.bib
, anthology-2.bib
and so on.
It would be quick to implement, but I dislike the idea of producing multiple file when the simplicity of a single one is in reach. It would also be hard to undo this once people became dependent on it.
I wonder if i could interest anyway in rolling up their sleeves and implementing the string-based approach suggested above. Here is a sketch of what it would look like:
python3 bin/create_bibtex.py --clean
. This will create a file build/
filled with BibTeX files for every paper in the Anthology, as well as the consolidated variants, including anthology.bib
.@string
for every volume, since every paper in a volume has the same booktitle. What's easy about this is that each volume ID is composed of an ASCII venue ID, a four-digit year, and a volume name, which are all permissible characters.For example,
@proceedings{yrrsds-2023-young,
title = "Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems",
editor = "Hudecek, Vojtech and
Schmidtova, Patricia and
Dinkar, Tanvi and
Chiyah-Garcia, Javier and
Sieinska, Weronika",
month = sep,
year = "2023",
address = "Prague, Czechia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.yrrsds-1.0",
}
will become
@string{YRRSDS:2023:1 = {Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems}}
@proceedings{yrrsds-2023-young,
title = YRRSDS:2023:1,
editor = "Hudecek, Vojtech and
Schmidtova, Patricia and
Dinkar, Tanvi and
Chiyah-Garcia, Javier and
Sieinska, Weronika",
month = sep,
year = "2023",
address = "Prague, Czechia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.yrrsds-1.0",
}
Okay I went and did this: #3045.
Another possible solution is to split the ever-growing
anthology.bib
into multiple 50MB-each files and have permlinkanthology-1.bib
,anthology-2.bib
and so on.
Wouldn't anthology-2022.bib
(for <=2022) and anthology-2030.bib
(for <=2030) be more logical units? Or something else based on years. I usually know in which year the paper I am looking for was published. Now it would be a matter of just making sure that the particular year is included in \bibliography{anthology-2022,anthology-2030}
.
This would indeed split the files (in logical units), but the entries would still be self-contained.
Another possible solution is to split the ever-growing
anthology.bib
into multiple 50MB-each files and have permlinkanthology-1.bib
,anthology-2.bib
and so on.Wouldn't
anthology-2022.bib
(for <=2022) andanthology-2030.bib
(for <=2030) be more logical units? Or something else based on years. I usually know in which year the paper I am looking for was published. Now it would be a matter of just making sure that the particular year is included in\bibliography{anthology-2022,anthology-2030}
.This would indeed split the files (in logical units), but the entries would still be self-contained.
Agree with @zouharvi that splitting by a certain year threshold is a better fix.
I am not opposing @mjpost 's string substitution approach, but I think the splitting approach is a complement and more future proof. Afterall, there will be one day when the string substitution approach can no longer compress the bib file to 50MB.
I am not opposing @mjpost 's string substitution approach, but I think the splitting approach is a complement and more future proof. Afterall, there will be one day when the string substitution approach can no longer compress the bib file to 50MB.
Considering that the compressed bib file @mjpost shared in #3045 is still 43MB, I think that day will come sooner rather than later, especially since ~40% of that is from papers published in the last five years. I agree that considering a split by year of publication makes a lot of sense!
All right, let's keep this open and pinned, and move to a binned approach in the near future.
Overleaf has a 50 MB file size limit, and
anthology.bib
is now larger than this. We should create a compact BibTeX export using string substitution as suggested here. I'm not sure if this should just replace the current Anthology bib file, or become a new export, sayanthology-compact.bib
:I'm therefore include to simply replace
anthology.bib
.