dsifford / astrocite

Bibliography file format => AST => CSL JSON
MIT License
18 stars 6 forks source link

Error thrown when a newline is encountered within an RIS field #11

Open hubgit opened 4 years ago

hubgit commented 4 years ago

I'm not 100% sure whether newlines are valid within RIS fields, and it's not too hard to run a filter to remove them before passing the data to astrocite if needed, but I ran into this issue and thought it might be something that should be handled by the parser.

Code

import { parse } from 'astrocite-ris'

const item = parse(`TY  - JOUR
AB  - Brazilian Amazon forests contain a large stock of carbon that could be released into the atmosphere as a result of land use and cover change. To quantify the carbon stocks, Brazil has forest inventory plots from different sources, but they are unstandardized and not always available to the scientific community. Considering the Brazilian Amazon extension, the use of remote sensing, combined with forest inventory plots, is one of the best options to estimate forest aboveground biomass (AGB). Nevertheless, the combination of limited forest inventory data and different remote sensing products has resulted in significant differences in the spatial distribution of AGB estimates. This study evaluates the spatial coverage of AGB data (forest inventory plots, AGB maps and remote sensing products) in undisturbed forests in the Brazilian Amazon. Additionally, we analyze the interconnection between these data and AGB stakeholders producing the information. Specifically, we provide the first benchmark of the existing field plots in terms of their size, frequency, and spatial distribution.
We synthesized the coverage of forest inventory plots, AGB maps and airborne light detection and ranging (LiDAR) transects of the Brazilian Amazon. Although several extensive forest inventories have been implemented, these AGB data cover a small fraction of this region (e.g., central Amazon remains largely uncovered). Although the use of new technology such as airborne LiDAR cover a significant extension of AGB surveys, these data and forest plots represent only 1% of the entire forest area of the Brazilian Amazon.
Considering that several institutions involved in forest inventories of the Brazilian Amazon have different goals, protocols, and time frames for forest surveys, forest inventory data of the Brazilian Amazon remain unstandardized. Research funding agencies have a very important role in establishing a clear sharing policy to make data free and open as well as in harmonizing the collection procedure. Nevertheless, the use of old and new forest inventory plots combined with airborne LiDAR data and satellite images will likely reduce the uncertainty of the AGB distribution of the Brazilian Amazon.
AU  - Tejada, Graciela
AU  - Görgens, Eric Bastos
AU  - Espírito-Santo, Fernando Del Bon
AU  - Cantinho, Roberta Zecchini
AU  - Ometto, Jean Pierre
CY  - United Kingdom
DA  - 2019/09/03
DO  - 10.1186/s13021-019-0126-8
ID  - 035-899-051-865-265
IS  - 1
JF  - Carbon balance and management
KW  - Aboveground biomass
KW  - Amazon
KW  - Carbon
KW  - REDD+
KW  - Remote sensing
KW  - Tropical rain forest
PB  - BioMed Central
PY  - 2019
SN  - 17500680
SP  - 11
TI  - Evaluating spatial coverage of data on the aboveground biomass in undisturbed forests in the Brazilian Amazon.
UR  - https://lens.org/035-899-051-865-265
VL  - 14
ER  -
`)

Expected behavior:

A parsed item, with an abstract containing newlines.

Actual behavior:

Expected Mandatory Horizontal Whitespace or [A-Z0-9] but "e" found.

hubgit commented 4 years ago

Feel free to close this if newlines in RIS fields aren't valid - it feels like that would be reasonable, as otherwise the contents of the text could accidentally start a new field.

dsifford commented 4 years ago

Yeah I don't have the info on hand, but I believe all fields are supposed to proceed until the end of the line without any new lines.

If you're able to find documentation negating that, I'd be happy to look into adjusting the parser.

Did this file get generated from some other application? Haven't ran into this before myself.

hubgit commented 4 years ago

Did this file get generated from some other application?

I'm testing with BibTeX files generated by https://lens.org but there are some other issues with abstracts (e.g. XML markup) in the files that suggest it's still a work-in-progress.

hubgit commented 4 years ago

If you're able to find documentation negating that, I'd be happy to look into adjusting the parser.

I found this in the RIS specification:

How to handle long fields

If the information following any one tag is more than 70 characters long, it is allowable (though not necessary) to insert a carriage return/line feed at the end of 70 characters, and continue on the next line.

retorquere commented 4 years ago

That's my experience too -- newlines are allowed, but the RIS parser will get confused if the text-with-newlines includes a valid field identifier after a newline. So encoding the abstract pretty strange\n\nAS - oops!\n as AS - pretty strange\n\nAS - oops!\n will confuse RIS parsers, but AS - pretty strange\n\n fine!\n will parse OK.