BibTeX parser - Githubissues

retorquere commented 7 years ago

I see this project has a bibtex parser already, but it doesn't seem to handle things like translation to unicode equivalents; I can recommend https://github.com/fiduswriter/biblatex-csl-converter, perhaps a cooperative effort could be helpful to both.

dsifford commented 7 years ago

@retorquere The parser does actually convert to unicode characters.

Example:

Input:

https://github.com/dsifford/astrocite/blob/ac826791978859fda3e52f73b7dd17daa733eb10/packages/astrocite-bibtex/src/__tests__/cases/dicraticals.bib#L1-L7

Output:

https://github.com/dsifford/astrocite/blob/ac826791978859fda3e52f73b7dd17daa733eb10/packages/astrocite-bibtex/src/__tests__/__snapshots__/parser-test.ts.snap#L191-L204

Was there something that I missed that I can maybe add in?

I'm familiar with biblatex-csl-converter, but opted to create this library instead because I personally believe parsing into Abstract Syntax Trees first results in a safer and more predictable experience for the user.

retorquere commented 7 years ago

My mistake.

I'll run astrocite over my test suite to see what bubbles up.

dsifford commented 7 years ago

Great! I appreciate the feedback.

There's likely still a few bugs to work out so keep me posted on what you dig up 😄

Also, I'm curious what the exact difference is between bibtex and biblatex? Does the latter just provide more reference types/fields? Is there a spec somewhere I can look at?

Thanks in advance!

retorquere commented 7 years ago

biblatex offers more, more fine-grained (in the case of authors and dates) and different fields. It's hard to decide based on the input whether you're being handed a bibliography intended to be processed by biblatex or bibtex, so I just try to do best-effort translation of fields without assuming a format preference.

For documentation, in my experience, most people will use BibTeXing and Tame the BeaST for BibTeX, and The Biblatex Package for BibLaTeX, but there isn't really a definitive manual for either format that is universally followed by Bib(La)TeX editors/processors, and given the nature of latex (it is a full-blown programming language), people can do some hair-raising stuff you won't find in any manual. I parse the documented fields as per the docs (WRT names, literal lists, verbatim fields, etc) and will parse everything outside the spec as a non-verbatim field.

retorquere commented 7 years ago

These samples from my test set each individually cause the parser to throw an error.

errors.zip

dsifford commented 7 years ago

@retorquere Thanks for the resources. Super helpful!

Gonna start sifting through these error-causing files here now and report back with questions as they come up...

First question:

in arXiv identifiers in BibLaTeX export #460.biblatex, i see that the IDs are formatted like this...

@article{Sen.2016.BV,
% ...
}

It's my understanding that . characters are not allowed? Am I mistaken?

EDIT: Checked this in ShareLatex and it works. So I'll fix this.

dsifford commented 7 years ago

Same for this file: export/Better BibLaTeX.012.biblatex

@article{10.1000/182+physical_volcanology_1600_eruption,
% ...
}

Here there are +, /, and . symbols. Are those allowed? I vaguely recall reading somewhere that only alphanumeric characters and _, :, or - were allowed.

EDIT: Checked this in ShareLatex and it works. So I'll fix this.

dsifford commented 7 years ago

In this file export/Empty bibtex clause in extra gobbles whatever follows #99.bibtex there also appears to be invalid symbols used in property keys...

@book{2014,
  note = {\url{http://ptolemy.org/books/Systems}},
  ptolemaeus:14:systemdesigneditor = {Claudius Ptolemaeus}, % <--------- This line
  publisher = {{Ptolemy.org}},
  timestamp = {2015-02-24 12:14:36 +0100},
  title = {System Design, Modeling, and Simulation Using {{Ptolemy II}}},
  year = {2014}
}

EDIT: This causes a critical error in ShareLatex. So I'm not going to add support for this since it doesn't appear to be a valid property.

dsifford commented 7 years ago

export/Pandoc Citation.latex should throw an error because it isn't a bibtex file

dsifford commented 7 years ago

import/Endnote should parse.bib should throw error because the entry doesn't have an id

dsifford commented 7 years ago

String expression in import/Import fails to perform @String substitutions #154.bib... Are : allowed in the macro name?

@String{pub-FRED:adr = "London, UK"}

EDIT: Tried this in ShareLatex and it works. So I'll add support for it.

Question: Is there a spec on what characters can exist in macro names? Feels strange allowing non-alpha characters.

dsifford commented 7 years ago

According to the biblatex spec that you shared above, no property key should have - in them. Several of the failures are caused by this. These should be failures.

Examples of properties that are used with this issue:

Date-Added
Date-Modified
Bdsk-Url-1
author-email
doc-delivery-number
funding-acknowledgement
funding-text
journal-iso
keywords-plus
number-of-cited-references
subject-category
times-cited
added-at

All are not valid to my knowledge. So these should throw errors.

EDIT: Tried this in ShareLatex and it appears to compile without any issues. So I guess I'll add support for properties with - in the name. These fields will still be skipped for now when parsing CSL because there isn't a specification for them anywhere. They'll be present in the AST though.

retorquere commented 7 years ago

Urgh, my bad on most of those -- anything that has export in the path should be ignored.

What remains:

Endnote should parse.bib is unfortunately how Endnote exports to BibTeX. biblatex-csl-converter (which is what BBT uses) imports these for that reason. My main target is to get references imported, or at least as many as possible; our needs may well differ.
Import fails to perform @String substitutions #154 if by "allowed" you mean "does it compile", then yes.
properties with dashes: these may or may not be valid, but biber accepts them and output renders as expected.

These kinds of things is why I mentioned "there isn't really a definitive manual for either format that is universally followed by Bib(La)TeX editors/processors". bib(la)tex is messy, and the documentation of each describe what ought to work as a minimum (if it doesn't work, you have a right to complain), it does not describe the limits of what will be accepted/what will work.

dsifford commented 7 years ago

Got it.

So to be clear, EndNote exports citations without any IDs? Has anybody reached out to them about this?

(See EDIT: notes above for other questions)

retorquere commented 7 years ago

Yah, the ptolemaeus can be ignored -- it properly belongs in the BBT test set but for reasons that are not relevant to you. I usually use ShareLatex too to see what actually works.

Yes, Endnote exports without IDs. Endnote doesn't have the concept of reference IDs (just as Zotero), and if I'd venture a guess they don't export it for that reason. Zotero generates IDs from the reference itself. I have not reached out to EndNote.

retorquere commented 7 years ago

In any case it is not a problem if our needs differ. biblatex-csl-converter does what I need, plus I can't use a parser that throws errors; Zotero does not allow for user interaction during import, all the user sees is "something went wrong", so biblatex-csl-converter is pretty forgiving, putting errors in the parser output for me to handle as I see fit.

dsifford commented 7 years ago

Got it.

So pretty much all these can be fixed just by addressing the couple things above. The only thing that still doesn't work after addressing the fixable issues is the EndNote-no-ID thing.

Entries without IDs are fine, they just at the very least need to have a comma there. Even ShareLatex errors in that scenario as well.

Just to be totally clear. Exporting from EndNote does not produce a comma where the ID should be, correct? If that's the case, the issue is in EndNote and I'll reach out to them.

Edit: Downloading a copy of endnote now from my university to check on this.

dsifford commented 7 years ago

I can't use a parser that throws errors

Why not just try/catch in the scenarios where errors are possible? Not using try/catch and using the situation that you described where you get a string with the type of error back works for expected errors, but it will still be fragile in situations where unexpected errors occur (and are subsequently thrown).

dsifford commented 7 years ago

@retorquere Just tried exporting from EndNote X8 and I can confirm that EndNote does produce IDs for bibtex.

@article{RN2,
   % ... [redacted]
}

@article{RN3,
   % ... [redacted]
}

retorquere commented 7 years ago

Must be an older version then. But I have to allow for older references to import. Almost all the references in my test suite are "live" references I got from actual users.

The throw-catch approach is an all-or-nothing approach to import; any error means nothing at all gets imported. I prefer to do partial imports instead, and adding a note about any errors I found; that way, the person doing the import can (if they want to at all) just fix the refs not imported all at once. With throw-catch the person doing the import would have to do a new import for each error in the bibliography. It's just not a workflow that I think people will like. I am also constrained here by the fact that the importers are a) always and only user-initiated, and b) can have no UI, so I can't go back-and-forth with the user to fix the import. If the import fails, I can choose to say to Zotero "there was nothing to do" (effectively ignoring the error) or re-throw the error, in which case Zotero will say only "something went wrong". Not very helpful.

It is important to note here that BBT is mainly about exporting references, not importing them; I want to help people get as many references into Zotero as they can, and as the existing Zotero BibTeX importer was relatively simple it was missing some hairier cases, and I stepped in to handle those where I can. This is why I prefer to use a lenient parser, and why I want to keep support for the older Endnote-generated references. But the main concern of BBT is to make sensible bib(la)tex references from Zotero references.

retorquere commented 7 years ago

biblatex-csl-converter does not throw errors as part of its normal operation -- of course it can contain errors and throw errors by mistake, but this is the case for any and all statements in a program; I'm not wrapping every statement with its own catch block.

Diff'rent Strokes for diff'rent folks. I think the astrocite approach is perfectly fine, it's just not a match for my needs -- which is no problem, as biblatex-csl-converter works well for my use-case.

dsifford commented 7 years ago

Let me see if I can get this to work in the ID-less case. That's the only test that currently fails. Might be able to pull off something.

Re: biblatex-csl-converter -- If you guys need to add extended functions beyond the capabilities of astrocite-bibtex, you're more than welcome to use the AST and create your own parser from there. I'd recommend that route as it's arguably the safest, fastest, and easiest way to do it.

retorquere commented 7 years ago

I encourage cooperation and re-use (it is after all why I opened this issue), but for my personal use-case, error handling by throwing errors would be a step back, and I can't readily envision how biblatex-csl-parser would compensate for that. Not trying to be antagonistic here, and there would be great value in cooperation from my POV, but there may just be different use cases. If the ptolemaeus were part of a larger bibliography, I'd want to import all the rest and report on that one reference.

retorquere commented 7 years ago

this reference also throws an error:

@inproceedings{Soriguera2012,
abstract = {1 The development and decreased cost of technology and communications have brought about a 2 huge increase in the availability of traffic data. With every passing day, traffic management 3 centers must deal with an increased amount of detailed data. Once the real time use of these data 4 is complete, they must be stored for long periods of time. In this long term context, the vast 5 amount of raw data is meaningless, which is a clear example of data asphyxiation. 6 Traffic management centers must aggregate and synthesize the data in order to extract the 7 maximum knowledge from them. Pattern classification is a way to deal with this issue. 8 Traditionally, traffic demand patterns have been easily constructed using ad hoc methods, where 9 " experience " is their main attribute. These procedures lack the required rigor to support current 10 needs in terms of planning and operational management. 11 The present paper proposes a method to systematically derive traffic demand patterns 12 from historical data. The method is based on the cluster analysis technique, and allows the 13 inclusion of preexistent knowledge, which eases the interpretation and practical use of the 14 results. The proposed pattern classification procedure is applied to five years of hourly traffic 15 volumes on a Spanish highway. The obtained results prove the validity and utility of the method 16 to accurately summarize the seasonal and daily characteristics of traffic demand. 17 18 19},
author = {Soriguera, F and Rosas, D},
booktitle = {Transportation Research Board, 91st Annual Meeting},
file = {:C$\backslash$:/Users/amitrani/Documents/Mendeley/Soriguera, Rosas - 2012 - Deriving Traffic Demand Patterns from Historical Data.pdf:pdf},
pages = {1--18},
title = {{Deriving Traffic Demand Patterns from Historical Data}},
year = {2012}
}

dsifford commented 7 years ago

I think I got it figured out.

Got all test cases passing. In cases that don't have an ID, the parser will create a unique one.

Gonna push the fixes in a few. It should work for all cases you have.

retorquere commented 7 years ago

Wouldn't generating an ID go against the idea of an AST?

dsifford commented 7 years ago

Solid point.

Scratch that. Empty string it is!

dsifford commented 7 years ago

@retorquere Should be good to go now. Let me know if you run into any other issues.

Thanks again for the feedback.

retorquere commented 7 years ago

Almost everything passes now, but this still throws an error:

@inproceedings{Zhang2012i,
abstract = {Mixed findings have been reported in previous research regarding the impact of built environment on travel behavior, i.e. statistically and practically significant effects found in a number of empirical studies and insignificant correlations shown in many other studies. It is not clear why the estimated impact is stronger or weaker in certain urban areas, and how effective a proposed land use change/policy will be in changing certain travel behavior. This knowledge gap has made it difficult for decision makers to evaluate land use plans and policies according to their impact on vehicle miles traveled (VMT), and consequently their impact on congestion mitigation, energy conservation, and pollution and green house gas emission reduction.    $\backslash$n$\backslash$nThis research has several objectives: (1) Re-examine the effects of built-environment factors on travel behavior, in particular VMT in five U.S. metropolitan areas grouped into four case study areas; (2) Develop consistent models in all case study areas with the same model specification and datasets to enable direct comparisons; (3) Identify factors such as existing land use characteristics and land use policy decision-making processes that may explain the different impacts of built environment on VMT in different urban areas; and (4) Provide a prototype tool for government agencies and decision-makers to estimate the impact of proposed land use changes on VMT.     $\backslash$n$\backslash$nThe four case study areas include Seattle, WA; Richmond-Petersburg and Norfolk-Virginia Beach, VA; Baltimore, MD; and Washington DC. Our empirical analysis employs Bayesian multilevel models with various person-level socio-economic and demographic variables and five built-environment factors including residential density, employment density, entropy (measuring level of mixed-use development), average block size (measuring transit/walking friendliness), and distance to city center (measuring decentralization and level of infill development).$\backslash$n$\backslash$nOur findings show that promoting compact, mixed-use, small-block and infill developments can be effective in reducing VMT per person in all four case study areas. However, the effectiveness of land use plans and policies encouraging these types of land developments is different both across case study areas and within the same case study area. We have identified several factors that potentially influence the connection between built environment shifts and VMT changes including urban area size, existing built environment characteristics, transit service coverage and quality, and land use decision-making processes.},
author = {Zhang, Lei and Nasri, Arefeh and Hong, Jin Hyun and Shen, Qing},
booktitle = {Transportation Research Board, 91st Annual Meeting},
doi = {10.5198/jtlu.v5i3.266},
file = {:C$\backslash$:/Users/amitrani/Documents/Mendeley/Zhang et al. - 2012 - How built environment affects travel behavior A comparative analysis of the connections between land use and vehic.pdf:pdf},
isbn = {{1938-7849|escape{\}}},
issn = {1938-7849},
keywords = {built environment,land use change,multilevel bayesian model,portation planning policy,travel behavior,us urban trans-,vehicle miles traveled,vmt},
number = {3},
pages = {40--52},
title = {{How built environment affects travel behavior: A comparative analysis of the connections between land use and vehicle miles traveled in US cities}},
url = {https://www.jtlu.org/index.php/jtlu/article/view/266},
volume = {5},
year = {2012}
}

dsifford commented 7 years ago

@retorquere I think your test is what's broken...

Take a look at the isbn property...

isbn = {{1938-7849|escape{\}}},

The brackets are not properly matched....

isbn = {
         {
           1938-7849|escape{\}}
                             ^ This bracket is escaped.
       },

dsifford commented 7 years ago

I could be wrong.. I'm not sure if the |escape part holds any significance. Not familiar with that.

retorquere commented 7 years ago

No, you're right, this one is malformed. All my non-malformed references pass through now, I have no other test cases I could offer you right now.

dsifford commented 7 years ago

Phew! Glad to hear.

Thanks again for the awesome feedback.

ptarroso commented 4 years ago

Hi

The following example is incorrectly parsed by bibtexparser (or maybe an incorrectly bib but it parses ok in tex):

@article{refTest,
  title={Example text},
  author={Second{\'\i}{\'\i}, First},
  journal={Journal}
 }

The author should be "Secondíí, First" but it returns " {'author': 'Secondı́\,́ First'}.

Tested in Pyhton 3.7.5 with code:

with open(bibfile) as bibtex:
    parser = BibTexParser()
    parser.customization = convert_to_unicode
    bib = bibtexparser.load(bibtex, parser=parser)

bib.entries

hubgit commented 4 years ago

@ptarroso I think you're looking for https://github.com/sciunto-org/python-bibtexparser

ptarroso commented 4 years ago

@hubgit :flushed: I thought I was commenting there...

dsifford / astrocite

BibTeX parser #2