Ingest / create esources field

aaccomazzi commented 6 years ago

We want to create a new field, called esources, in SOLR to capture all the different sources of full-text available for a paper. This will be an array of strings from the following set:

PUB_PDF - paper has a link to publisher PDF fulltext
PUB_HTML - paper has a link to publisher HTML fulltext
ADS_PDF - paper has a link to ADS fulltext
ADS_SCAN - paper has a link to ADS scan
EPRINT_PDF - paper has a link to an eprint PDF (currently only arXiv, but possibly others in the future)
EPRINT_HTML - paper has a link to an eprint HTML (currently only arXiv)
AUTHOR_PDF - paper has a link to an author copy in PDF format
AUTHOR_HTML - paper has a link to an author copy in HTML format

The corresponding links data should be created from the corresponding non-bib tables.

The solr field should be multivalued strings, indexed and stored.

csgrant00 commented 6 years ago

Will we call all author-submitted pdfs AUTHOR_PDF even if they point to a journal website? Or should we try to separate them?

-Carolyn

On 9/22/17 5:26 PM, Alberto Accomazzi wrote:

We want to create a new field, called esources, in SOLR to capture all the different sources of full-text available for a paper. This will be an array of strings from the following set:

PUB_PDF - paper has a link to publisher PDF fulltext

PUB_HTML - paper has a link to publisher HTML fulltext

ADS_PDF - paper has a link to ADS fulltext

ADS_SCAN - paper has a link to ADS scan

EPRINT_PDF - paper has a link to an eprint PDF (currently only arXiv, but possibly others in the future)

EPRINT_HTML - paper has a link to an eprint HTML (currently only arXiv)

AUTHOR_PDF - paper has a link to an author copy in PDF format

AUTHOR_HTML - paper has a link to an author copy in HTML format

The corresponding links data should be created from the corresponding non-bib tables.

The solr field should be multivalued strings, indexed and stored.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSimportpipeline/issues/163, or mute the thread https://github.com/notifications/unsubscribe-auth/AFQ4lZJIV3xWSQ4V3ArPoE5wHQ-X23T_ks5slCYDgaJpZM4PhQnu.

--

    Carolyn Stern Grant              Astrophysics Data System (ADS)
    stern@cfa.harvard.edu            Center for Astrophysics
    617-495-7154 (voicemail)         60 Garden Street  MS 83
    617-495-7356 fax                 Cambridge, MA  02138

aaccomazzi commented 6 years ago

The intent is to have author_pdf point to author-hosted content, which may usually be the author copy of a paper.

I have created the proper directories under /proj/ads/abstracts/config/links to hold all of these tables, and in the process have tried to separate URLs to author-managed articles and pdfs from publisher supplied ones. The README files should explain what we're trying to accomplish.

aaccomazzi commented 6 years ago

NED object searches

romanchyla commented 6 years ago

added to solr, verified that data pipeline is sending the values inside 'esource' field - although can't see any real values now (because of the bug int he pipeline delivery)

adsabs / ADSImportPipeline

Ingest / create esources field #163