dalejn / cleanBib

Probabilistically assign gender and race proportions of first/last authors pairs in bibliography entries
MIT License
149 stars 31 forks source link

Error message in get_duplicates #53

Closed hpay closed 9 months ago

hpay commented 9 months ago

Hi! I'm trying to quickly run cleanBib.ipynb to analyze a manuscript in progress. I get an error in the second block of code. I don't have a huge amount of experience with python so I'm not sure how to debug. Any thoughts?

Error message is below.


---------------------------------------------------------------------------
TokenRequired                             Traceback (most recent call last)
Cell In[14], line 14
     11     get_names_published(homedir, bib_data, cr)
     12 else:
     13     # find and print duplicates
---> 14     bib_data = get_duplicates(bib_data, bib_files[0])
     15     # get names, remove CDS, find self cites
     16     get_names(homedir, bib_data, yourFirstAuthor, yourLastAuthor, optionalEqualContributors, cr)

File ~/utils/preprocessing.py:207, in get_duplicates(bib_data, filename)
    205     bib_data = get_bib_data(new_bib, "")
    206 else:
--> 207     bib_data = get_bib_data(filename, "")
    208 return bib_data

File ~/utils/preprocessing.py:165, in get_bib_data(filename, parser)
    162 else:
    163     # this one will error if you have duplicates
    164     parser = bibtex.Parser()
--> 165     bib_data = parser.parse_file(filename)
    167 return bib_data

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/__init__.py:54, in BaseParser.parse_file(self, filename, file_suffix)
     52 with open_file(filename, encoding=self.encoding) as f:
     53     try:
---> 54         self.parse_stream(f)
     55     except UnicodeDecodeError as e:
     56         raise PybtexError(six.text_type(e), filename=self.filename)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:412, in Parser.parse_stream(self, stream)
    410 def parse_stream(self, stream):
    411     text = stream.read()
--> 412     return self.parse_string(text)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:399, in Parser.parse_string(self, text)
    389 self.command_start = 0
    391 entry_iterator = LowLevelParser(
    392     text,
    393     keyless_entries=self.keyless_entries,
   (...)
    397     macros=self.macros,
    398 )
--> 399 for entry in entry_iterator:
    400     entry_type = entry[0]
    401     entry_type_lower = entry_type.lower()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:195, in LowLevelParser.parse_bibliography(self)
    193     yield tuple(self.parse_command())
    194 except PybtexSyntaxError as error:
--> 195     self.handle_error(error)
    196 except SkipEntry:
    197     pass

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:385, in Parser.handle_error(self, error)
    383 def handle_error(self, error):
    384     from pybtex.errors import report_error
--> 385     report_error(error)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/errors.py:78, in report_error(exception)
     75     return
     77 if strict:
---> 78     raise exception
     79 else:
     80     print_error(exception, 'WARNING: ')

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:193, in LowLevelParser.parse_bibliography(self)
    191 self.command_start = self.pos - 1
    192 try:
--> 193     yield tuple(self.parse_command())
    194 except PybtexSyntaxError as error:
    195     self.handle_error(error)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:226, in LowLevelParser.parse_command(self)
    224     self.required([body_end])
    225 except PybtexSyntaxError as error:
--> 226     self.handle_error(error)
    227 return make_result()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:385, in Parser.handle_error(self, error)
    383 def handle_error(self, error):
    384     from pybtex.errors import report_error
--> 385     report_error(error)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/errors.py:78, in report_error(exception)
     75     return
     77 if strict:
---> 78     raise exception
     79 else:
     80     print_error(exception, 'WARNING: ')

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:223, in LowLevelParser.parse_command(self)
    221     make_result = lambda: (command, (self.current_entry_key, self.current_fields))
    222 try:
--> 223     parse_body(body_end)
    224     self.required([body_end])
    225 except PybtexSyntaxError as error:

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:242, in LowLevelParser.parse_entry_body(self, body_end)
    240     key_pattern = self.KEY_PAREN if body_end == self.RPAREN else self.KEY_BRACE
    241     self.current_entry_key = self.required([key_pattern]).value
--> 242 self.parse_entry_fields()
    243 if not self.want_current_entry():
    244     raise SkipEntry

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:250, in LowLevelParser.parse_entry_fields(self)
    248 self.current_field_name = None
    249 self.current_value = []
--> 250 self.parse_field()
    251 if self.current_field_name and self.current_value:
    252     self.current_fields.append((self.current_field_name, self.current_value))

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/database/input/bibtex.py:262, in LowLevelParser.parse_field(self)
    260     return
    261 self.current_field_name = name.value
--> 262 self.required([self.EQUALS])
    263 self.parse_value()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pybtex/scanner.py:120, in Scanner.required(self, patterns, description, allow_eof)
    118     if not description:
    119         description = ' or '.join(pattern.description for pattern in patterns)
--> 120     raise TokenRequired(description, self)
    121 else:
    122     return token

TokenRequired: syntax error in line 3: '=' expected
dalejn commented 9 months ago

Thanks for trying out this tool and happy new year! Happy to help debug.

This looks like the pybtex function couldn't read an entry of the .bib file. Could you please attach here or send me the .bib file you're using? It could be as simple as fixing the formatting of that entry.

hpay commented 9 months ago

Thanks! I'd appreciate that. The .bib file is attached here (after renaming the extension to .txt)

I generated it by changing the citation style for mendeley to bibtex, generating a bibliography in my word document, and copy-pasting to a text file

Bibtex_utf8.txt

dalejn commented 9 months ago

Got it, thanks!

Looks like some formatting issues got introduced when the citation styles were auto-converted. The line number at the bottom of that long error message will tell you what line the code got stuck in the bib file.

Most of the errors come from spaces surrounding nobiliary particles in surnames (e.g. de, du, van), which need to be replaced with underscores or else the parser won't be able to recognize the start of the next field. Other minor formatting issues for spaces instead of underscores between et al. in the entry ID tag and one line that contained two entries.

I went through and fixed these, see attached .txt file Bibtex_utf8.txt.

And making note of a few examples below in case it's useful for future reference.

Line 3, change Bagnall_McElvain_Faulstich_du Lac_2008 to Bagnall_McElvain_Faulstich_du_Lac_2008 Line 17, change lopath_Badura_De Zeeuw_Brunel_2014 to lopath_Badura_De_Zeeuw_Brunel_2014 etc.

Line 85: underscored et_al.2011 instead of et al.2011

Line 31, there's two entries in one line. Added line break before the next @article

P.S. I find it useful to open the .bib file in something like Sublime Text or a .tex editor because they colorcode the fields for you, which makes it much easier to spot when there's an inconsistency in the formatting

Feel free to let me know if there's any other issues you run into, happy to help!

dalejn commented 9 months ago

Some more issues run into after these formatting ones: the entries Kawano_Shidara_Watanabe_Yamane_1994 and Newsome_Wurtz_Komatsu_1988 have issues with their DOI numbers, which will return an error like "DOI specified as 2305-2324 but must be of the form: 10.prefix\/suffix where prefix is 4 or more digits and suffix is a string"

To resolve, delete the DOI field for those entries or replace them with the right ones by manually searching for them like below:

@article{Kawano_Shidara_Watanabe_Yamane_1994, title={Neural activity in cortical area MST of alert monkey during ocular following responses}, volume={71}, ISBN={0022-3077 (Print)r0022-3077 (Linking)}, ISSN={0022-3077}, DOI={10.1152/jn.1994.71.6.2305}, number={6}, journal={Journal of Neurophysiology}, author={Kawano, Kenji and Shidara, M. and Watanabe, Y and Yamane, S}, year={1994}, pages={2305–2324} }

@article{Newsome_Wurtz_Komatsu_1988, title={Relation of cortical areas MT and MST to pursuit eye movements. II. Differentiation of retinal from extraretinal inputs}, volume={60}, ISBN={0022-3077 (Print)r0022-3077 (Linking)}, ISSN={00223077}, DOI={10.1152/jn.1988.60.2.604}, number={2}, journal={Journal of Neurophysiology}, author={Newsome, W. T. and Wurtz, R. H. and Komatsu, H.}, year={1988}, pages={604–620} }

hpay commented 9 months ago

Thank you! I ran through it without further errors. I did have one additional question that I'll post as a separate issue.