acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
443 stars 300 forks source link

Compact bibfile #3016

Closed mjpost closed 10 months ago

mjpost commented 10 months ago

Overleaf has a 50 MB file size limit, and anthology.bib is now larger than this. We should create a compact BibTeX export using string substitution as suggested here. I'm not sure if this should just replace the current Anthology bib file, or become a new export, say anthology-compact.bib:

I'm therefore include to simply replace anthology.bib.

zouharvi commented 10 months ago

I prepared a minified version which drops some fields that I don't think are important in bibliography, such as: editor, month, address, and duplicate url + doi. https://github.com/zouharvi/anthology-bib-small

Maybe this could be included in your minification so that instead of having the same issue in 2026, we have it in 2028, if LaTeX/Overleaf is still around by then?

That said, this is a very opinionated change to remove all these fields. However, except for the address, the output in ACL Natbib style would be the same, I believe.

mjpost commented 10 months ago

Thanks for the effort and the initiative! As you suspected, though, we can't officially sanction this because it removes information that is important for complete and correct citations. This issue comes up, for example, in tenure and promotion considerations. We recently added the editor field (#2706) which probably contributes to the increased file size.

The approach I think we should take would be to use the string substitutions suggested here.

chikiulo commented 10 months ago

Another possible solution is to split the ever-growing anthology.bib into multiple 50MB-each files and have permlink anthology-1.bib, anthology-2.bib and so on.

mjpost commented 10 months ago

It would be quick to implement, but I dislike the idea of producing multiple file when the simplicity of a single one is in reach. It would also be hard to undo this once people became dependent on it.

I wonder if i could interest anyway in rolling up their sleeves and implementing the string-based approach suggested above. Here is a sketch of what it would look like:

For example,

@proceedings{yrrsds-2023-young,
    title = "Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems",
    editor = "Hudecek, Vojtech  and
      Schmidtova, Patricia  and
      Dinkar, Tanvi  and
      Chiyah-Garcia, Javier  and
      Sieinska, Weronika",
    month = sep,
    year = "2023",
    address = "Prague, Czechia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.yrrsds-1.0",
}

will become

@string{YRRSDS:2023:1 = {Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems}}
@proceedings{yrrsds-2023-young,
    title = YRRSDS:2023:1,
    editor = "Hudecek, Vojtech  and
      Schmidtova, Patricia  and
      Dinkar, Tanvi  and
      Chiyah-Garcia, Javier  and
      Sieinska, Weronika",
    month = sep,
    year = "2023",
    address = "Prague, Czechia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.yrrsds-1.0",
}
mjpost commented 10 months ago

Okay I went and did this: #3045.

zouharvi commented 10 months ago

Another possible solution is to split the ever-growing anthology.bib into multiple 50MB-each files and have permlink anthology-1.bib, anthology-2.bib and so on.

Wouldn't anthology-2022.bib (for <=2022) and anthology-2030.bib (for <=2030) be more logical units? Or something else based on years. I usually know in which year the paper I am looking for was published. Now it would be a matter of just making sure that the particular year is included in \bibliography{anthology-2022,anthology-2030}.

This would indeed split the files (in logical units), but the entries would still be self-contained.

chikiulo commented 10 months ago

Another possible solution is to split the ever-growing anthology.bib into multiple 50MB-each files and have permlink anthology-1.bib, anthology-2.bib and so on.

Wouldn't anthology-2022.bib (for <=2022) and anthology-2030.bib (for <=2030) be more logical units? Or something else based on years. I usually know in which year the paper I am looking for was published. Now it would be a matter of just making sure that the particular year is included in \bibliography{anthology-2022,anthology-2030}.

This would indeed split the files (in logical units), but the entries would still be self-contained.

Agree with @zouharvi that splitting by a certain year threshold is a better fix.

I am not opposing @mjpost 's string substitution approach, but I think the splitting approach is a complement and more future proof. Afterall, there will be one day when the string substitution approach can no longer compress the bib file to 50MB.

mbollmann commented 10 months ago

I am not opposing @mjpost 's string substitution approach, but I think the splitting approach is a complement and more future proof. Afterall, there will be one day when the string substitution approach can no longer compress the bib file to 50MB.

Considering that the compressed bib file @mjpost shared in #3045 is still 43MB, I think that day will come sooner rather than later, especially since ~40% of that is from papers published in the last five years. I agree that considering a split by year of publication makes a lot of sense!

mjpost commented 10 months ago

All right, let's keep this open and pinned, and move to a binned approach in the near future.