adsabs / export_service

Export service to output ADS records with various formats including BibTex, AASTex, and multiple tagged and xml options
MIT License
3 stars 5 forks source link

BibTeX ABS export: trailing <P /> #172

Open golnazads opened 4 years ago

golnazads commented 4 years ago

Alberto replied to You @Carolyn @Golnaz sorry, I neglected to let you know of this possible markup. Please translate <P /> to blank lines and <BR /> to a newline when outputting in a non-XML format. I think this means that for bibtex it would be: <P /> => \\

golnazads commented 4 years ago

@aaccomazzi this is implemented for BibTex ABS. do I need to remove these tags for for example custom format unicode encoding. I am guessing it is a yes for latex encoding. If it is a yes for unicode, then I guess need to fix that for XML and fielded formats, right? thank you.

aaccomazzi commented 4 years ago

This is the situation with respect to encoding in our json fields (see e.g. 2016ApJ...818L..26F)

  1. abstract and title text have the basic HTML entities encoded (these are < > and &)
  2. they may also have some markup in the form of <SUB> etc.

When creating custom output, we recognize and support three basic encoding:

  1. HTML: In this case the entities and markup are kept as they are, so &lt; remains &lt;
  2. Latex: in this case the entities and markup are translated according to html -> latex syntax
  3. Unicode: In this case the entities are turned into their unicode equivalent, in this case it's just the three characters above which become <, >, &. The issue of markup for unicode encoding has never been formally defined in our documentation and I had to go check the code of classic to figure out what we are doing here. Turns out classic simply strips the markup: <SUB> -> (empty string)

I feel that the unicode handling of markup done by classic is wrong, because we provide a separate formatting option to control the treatment of markup (%ZMarkup:{keep|strip}), as documented here: http://adsabs.github.io/help/actions/export So I'm in favor of passing through markup as it is, and let users customize the output via formatting options.

golnazads commented 4 years ago

just for your information export has the option of markup keep|strip https://github.com/adsabs/export_service/blob/master/exportsrv/formatter/customFormat.py#L702. I can remove it if you want @aaccomazzi .

aaccomazzi commented 4 years ago

We should keep the markup option, this way users can control what they get or not get. So I think the adjustments to make for unicode encoding are:

  1. <P /> => \n\n (new paragraph)
  2. <BR /> => \n (newline)
  3. &amp;, &gt;, &lt; => &, >, <
  4. All other markup: controlled by %ZMarkup settings