BibTeX cleaner - Githubissues

mjpost commented 5 years ago

It would be great to have a pip tool (acl-cleaner?) that reads a bibtex file and replaces entries with canonical Anthology ones, perhaps with confirmation. It would keep the original cite key so that it wouldn't break the compile. The matching shouldn't be too hard, I wouldn't think—some kind of case-insensitive fuzzy match of the title, author list length, and author names.

Olamyy commented 5 years ago

Hi @mjpost , I'll love to take this up if no one else is looking into it.

akoehn commented 5 years ago

@Olamyy as far as we know noone is, so go ahead!

mjpost commented 5 years ago

Yes, this would be great! This link from @vered1986 may be useful.

Here is a sketch of the API I had in mind. Please take this as a starting point for discussion.

Usage: bibtex-cleaner [-f] FILE.BIB. By default, it prompts you to confirm matches that aren't 100% matches. With -f, it will overwrite everything above a to-be-determined match threshold.
It should read and write to the same file.
Maybe you have better ideas for the name!

Once this is working, we can create a pypi module for it and import the repo under acl-org, if you like.

davidweichiang commented 5 years ago

My suggestions:

An (optional) way of outputting the short names of conferences instead of the full names.
Programs that read and write to the same file scare me, probably just because they're not the norm. (fromdos/todos is the only one I can think of.)
Name: acleanbib

Olamyy commented 5 years ago

Here's the current flow I'm working with:

I converted the zipped anthology file (from running python3 bin/create_bibtex.py) into a single csv file such that for every query, I:

Read the user's bibtex
Load the main anthology
Do a sequential match of columns from the user's bibtex to the main anthology. The match order is :
- Match title. If the output matches more than one title in the anthology
  - Match the authors for the returned matches If the output matches more than one author in the matches:
    - Match the date and month

I imagine this approach would be much easier than having the user confirm matches. But of course, confirming matches would reduce the number of steps that go into matching the entries.

What do you think?

Olamyy commented 5 years ago

@davidweichiang I also agree with writing to a different file from the user's input. If for anything, one can easily go back and change some things then rerun the tool.

nschneid commented 5 years ago

An (optional) way of outputting the short names of conferences instead of the full names.

If you go this route, here's how the Zotero plugin uses the venue information in the Anthology to determine the abbreviation: https://github.com/zotero/translators/blob/99429bd5b2f2f95611911e50e95b910653d097f1/ACLWeb.js#L153-L186

mjpost commented 5 years ago

@davidweichiang: I had the same thought about outputting short names, as a non-default option, perhaps --short|-s.

Writing to a separate output file is a good idea. It would be nice, however, if the reading of the input file were totally separated, so that one could specify the same input and output file without fear of corruption. I would also suggest that STDIN and STDOUT be supported as defaults for input and output (or by using the - convention).

acleanbib is perfect.

@Olamyy:

This sounds good. A first working version could work on just exact matches, but I would not replace unless you match (a) title (b) author list and (c) year. Matching venue would be harder since the names can vary so much by small amounts. Once this is working you could move to fuzzier matching.

You might find the anthology/Anthology class helpful. You can also download the full anthology bib file from here, caching it to say ~/.acleanbib/anthology.bib.gz. We could add an md5 sum so that you could download that and see programmatically if the bib file needed to be re-downloaded.

mjpost commented 5 years ago

When we move to the expanded ID #291, we should have a database of

short code / identifier (e.g., acl, wmt, coling)
long conference name
short conference name

that would facilitate this. We could easily expand data/yaml/venues.yaml with this information and then it could also be used here.

Olamyy commented 5 years ago

Here's an initial implimentation of the idea. Feedbacks will be appreciated.

https://github.com/Olamyy/acleanbib

mjpost commented 5 years ago

Awesome! I get this error, though:

Traceback (most recent call last):
  File "/usr/local/bin/acleanbib", line 11, in <module>
    load_entry_point('acl-cleaner', 'console_scripts', 'acleanbib')()
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/Users/post/code/acleanbib/script.py", line 13, in aclbibcleaner
    cleaner = ACLCleaner(bibtex, output)
  File "/Users/post/code/acleanbib/cleaner.py", line 31, in __init__
    self.bibdata = pandas.read_csv(self.anthology_path, compression='zip', low_memory=False)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
    self._make_engine(self.engine)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1853, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 644, in pandas._libs.parsers.TextReader._setup_parser_source
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1222, in __init__
    self._RealGetContents()
  File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1289, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

Also, how can I run it from the local directory without installing (for development)?

Olamyy commented 5 years ago

The error is a bit weird. I can read the zipped file successfully on my end. I re-uploaded in my latest commit. Let me know if it works now.

Also, I added a commit to run it from the local directory by doing python script.py [BIBTEX] [OUTPUT]

mjpost commented 5 years ago

Thanks! That's better, but now I get this:

$ python3 script.py citation-220873086.bib
Python Version      : 3.7.2
Version tuple: ('3', '7', '2')
Compiler     : Clang 10.0.0 (clang-1000.11.45.5)
Build        : ('default', 'Mar  7 2019 12:42:28')
NO MATCH FOUND FOR Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes.
NO MATCH FOUND FOR Long Short-Term Memory Based Recurrent Neural Network Architectures
for Large Vocabulary Speech Recognition
BibTeX item 'entries' does not exist and will not be written. Valid items are ['entries', 'comments', 'preambles', 'strings'].
BibTeX item 'entries' does not exist and will not be written. Valid items are ['entries', 'comments', 'preambles', 'strings'].

(I can try to debug later but maybe you know what's wrong)

Olamyy commented 5 years ago

Hi @mjpost , I've made updates to suppress the logger from bibtexparser. I also added a reporting output to show the users the outcome of the process. Output is similar to:

Paper	Match Found
Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes.	False
Long Short-Term Memory Based Recurrent Neural Network Architectures
for Large Vocabulary Speech Recognition	False
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)	True
A Multi-media Approach to Cross-lingual Entity Knowledge Transfer	True

mjpost commented 5 years ago

Hi @Olamyy—awesome—will look at this when I catch a minute!

Olamyy commented 5 years ago

Hi @mjpost , Is there any update on your review of this PR?

mjpost commented 5 years ago

Thanks for the ping! I will look today.

akoehn commented 5 years ago

@Olamyy would you mind changing the license to Apache 2.0? There is not really a big difference between MIT and Apache 2.0, but the anthology already uses Apache as a license and having only one license would be easier if the code should be made official.

I unfortunately don't have time to look at the code at the moment.

mjpost commented 5 years ago

I looked at it a bit and created some issues on your repo! This will be really cool once we have it all worked out.

acl-org / acl-anthology

BibTeX cleaner #488