Open mjpost opened 5 years ago
Hi @mjpost , I'll love to take this up if no one else is looking into it.
@Olamyy as far as we know noone is, so go ahead!
Yes, this would be great! This link from @vered1986 may be useful.
Here is a sketch of the API I had in mind. Please take this as a starting point for discussion.
bibtex-cleaner [-f] FILE.BIB
. By default, it prompts you to confirm matches that aren't 100% matches. With -f
, it will overwrite everything above a to-be-determined match threshold.Once this is working, we can create a pypi module for it and import the repo under acl-org
, if you like.
My suggestions:
Here's the current flow I'm working with:
I converted the zipped anthology file (from running python3 bin/create_bibtex.py
) into a single csv file such that for every query, I:
I imagine this approach would be much easier than having the user confirm matches. But of course, confirming matches would reduce the number of steps that go into matching the entries.
What do you think?
@davidweichiang I also agree with writing to a different file from the user's input. If for anything, one can easily go back and change some things then rerun the tool.
- An (optional) way of outputting the short names of conferences instead of the full names.
If you go this route, here's how the Zotero plugin uses the venue information in the Anthology to determine the abbreviation: https://github.com/zotero/translators/blob/99429bd5b2f2f95611911e50e95b910653d097f1/ACLWeb.js#L153-L186
@davidweichiang: I had the same thought about outputting short names, as a non-default option, perhaps --short|-s
.
Writing to a separate output file is a good idea. It would be nice, however, if the reading of the input file were totally separated, so that one could specify the same input and output file without fear of corruption. I would also suggest that STDIN and STDOUT be supported as defaults for input and output (or by using the -
convention).
acleanbib
is perfect.
@Olamyy:
This sounds good. A first working version could work on just exact matches, but I would not replace unless you match (a) title (b) author list and (c) year. Matching venue would be harder since the names can vary so much by small amounts. Once this is working you could move to fuzzier matching.
You might find the anthology/Anthology
class helpful. You can also download the full anthology bib file from here, caching it to say ~/.acleanbib/anthology.bib.gz
. We could add an md5 sum so that you could download that and see programmatically if the bib file needed to be re-downloaded.
When we move to the expanded ID #291, we should have a database of
acl
, wmt
, coling
)that would facilitate this. We could easily expand data/yaml/venues.yaml
with this information and then it could also be used here.
Here's an initial implimentation of the idea. Feedbacks will be appreciated.
Awesome! I get this error, though:
Traceback (most recent call last):
File "/usr/local/bin/acleanbib", line 11, in <module>
load_entry_point('acl-cleaner', 'console_scripts', 'acleanbib')()
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/Users/post/code/acleanbib/script.py", line 13, in aclbibcleaner
cleaner = ACLCleaner(bibtex, output)
File "/Users/post/code/acleanbib/cleaner.py", line 31, in __init__
self.bibdata = pandas.read_csv(self.anthology_path, compression='zip', low_memory=False)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 702, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 429, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 895, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1122, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1853, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 387, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 644, in pandas._libs.parsers.TextReader._setup_parser_source
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1222, in __init__
self._RealGetContents()
File "/usr/local/Cellar/python/3.7.2_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/zipfile.py", line 1289, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
Also, how can I run it from the local directory without installing (for development)?
The error is a bit weird. I can read the zipped file successfully on my end. I re-uploaded in my latest commit. Let me know if it works now.
Also, I added a commit to run it from the local directory by doing python script.py [BIBTEX] [OUTPUT]
Thanks! That's better, but now I get this:
$ python3 script.py citation-220873086.bib
Python Version : 3.7.2
Version tuple: ('3', '7', '2')
Compiler : Clang 10.0.0 (clang-1000.11.45.5)
Build : ('default', 'Mar 7 2019 12:42:28')
NO MATCH FOUND FOR Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes.
NO MATCH FOUND FOR Long Short-Term Memory Based Recurrent Neural Network Architectures
for Large Vocabulary Speech Recognition
BibTeX item 'entries' does not exist and will not be written. Valid items are ['entries', 'comments', 'preambles', 'strings'].
BibTeX item 'entries' does not exist and will not be written. Valid items are ['entries', 'comments', 'preambles', 'strings'].
(I can try to debug later but maybe you know what's wrong)
Hi @mjpost , I've made updates to suppress the logger from bibtexparser. I also added a reporting output to show the users the outcome of the process. Output is similar to:
Paper | Match Found |
---|---|
Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes. | False |
Long Short-Term Memory Based Recurrent Neural Network Architectures | |
for Large Vocabulary Speech Recognition | False |
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) | True |
A Multi-media Approach to Cross-lingual Entity Knowledge Transfer | True |
Hi @Olamyy—awesome—will look at this when I catch a minute!
Hi @mjpost , Is there any update on your review of this PR?
Thanks for the ping! I will look today.
@Olamyy would you mind changing the license to Apache 2.0? There is not really a big difference between MIT and Apache 2.0, but the anthology already uses Apache as a license and having only one license would be easier if the code should be made official.
I unfortunately don't have time to look at the code at the moment.
I looked at it a bit and created some issues on your repo! This will be really cool once we have it all worked out.
It would be great to have a pip tool (
acl-cleaner
?) that reads a bibtex file and replaces entries with canonical Anthology ones, perhaps with confirmation. It would keep the original cite key so that it wouldn't break the compile. The matching shouldn't be too hard, I wouldn't think—some kind of case-insensitive fuzzy match of the title, author list length, and author names.