adsabs / ADSReferencePipeline

Pipeline to resolver reference (ie, match it with the record in ADS)
MIT License
0 stars 2 forks source link

Add command to allow power user to add/delete/modify mapping between reference file and parser #6

Open golnazads opened 1 month ago

golnazads commented 1 month ago

On Thu, Jul 11, 2024 at 8:33 AM Shapurian, Golnaz golnaz.shapurian@cfa.harvard.edu wrote:

From googledoc

Golnaz Shapurian • 7:56 AM, Jul 1 (EDT) I have to check this out, my guess is that the mapping is not recognized. Is this a new bibstem, it is not familiar!

Edwin Henneken • 3:35 PM, Jul 1 (EDT) New The reference service should be updated on a continuous basis with new bibstems (see issue #65 reference service)

Alberto, This is related to issue #5 that I showed yesterday and said that I think the mapping is not correct.

Edwin the mapping is done from the reference pipeline side, I will checkout issue #65 in reference service and if it is related to this then I will bring it to the pipeline side.

I thought pipeline could have a command to allow users to add/modify what is known in classic as config files (ie, mapping bibstem/directory/file extension to parser). So I started to think that since curators are familiar with classic's config files, let them create new or revise old config files and submit them to the pipeline to read and update the database.

Thoughts?

golnaz

========================================================================================

On Thu, Jul 11, 2024 at 10:40 AM Accomazzi, Alberto aaccomazzi@cfa.harvard.edu wrote:

I am confused... I always thought the bibstem mappings are part of the reference service (https://github.com/adsabs/reference_service/tree/master/referencesrv/resolver/sourcematcher_dat), not the reference pipeline side.

Either way, we need a clear way to update them. I think being able to send the data mapping data is a good idea, so long as it's simple and secure (only authorized users should be allowed to update the mappings).

-- AA

========================================================================================

On Thu, Jul 11, 2024 at 10:57 AM Shapurian, Golnaz golnaz.shapurian@cfa.harvard.edu wrote:

Maybe this is another case of overused word, lol.

So the mapping I am talking about here is to give the reference file path, map it to a specific parser. The config files that classic uses to go from path to parser is rightly so in the pipeline in the database ADSReferencePipeline/alembic/versions/55d2bf274509_created_db_structure.py at master · adsabs/ADSReferencePipeline (github.com) For example {'name': 'PASJhtml', 'extension_pattern': '.raw', 'reference_service_endpoint': '/text', 'matches': [ {"journal": "PASJ", "volume_begin": 51, "volume_end": 53}, ]},

PASJhtml parser will process all the reference files with extension raw that have the bibstem PASJ with volumes 51 to 53 inclusive. So any of files PASJ/0051/.raw PASJ/0052/.raw PASJ/0053/*.raw will be parsed by PASJhtml.

I know you know all these. What do you call this process that I call mapping?

For authorized/secure users, since this is a pipeline, I am guessing only authorized/secure users have access to it, right?

golnaz

========================================================================================

On Thu, Jul 11, 2024 at 11:14 AM Accomazzi, Alberto aaccomazzi@cfa.harvard.edu wrote:

Ok, then we have been talking about cross-purposes.

It sounds like Edwin was discussing updating mappings between bibstems and journal names (reference_service issue #65), while you are talking about file mapping on the pipeline side (ADSReferencePipeline #5).

Now I understand, and yes, there should be a way to update this type of configuration, although I'm not sure what's best. Ideally we should use file extensions to indicate the format wherever appropriate, the fact that we have so many "*.raw" files that require their own configuration seems wasteful.

-- AA

========================================================================================

Shapurian, Golnaz golnaz.shapurian@cfa.harvard.edu Thu, Jul 11, 11:50 AM (5 days ago)
to Alberto, Edwin

Yes, there are a couple of other instances that reference service issues have been mixed in with reference pipeline feedback.

OK, so, I am neither getting the answer "good idea, go ahead and implement it" nor "might not be needed, why waste time" answer from you Alberto. And so I am going to create the issue in the github, until there are at least 3 requests for update/delete/add to the database for these mappings and then at that point the functionality can be implemented.

golnaz

golnazads commented 1 month ago

Allow users to send classic format config file, for example

<config journal="A+AS">
  <volume exclude="1:120" />
  <volume number="121:126">
     <handler type='retrieve' name="AnAShtml" />
     <handler type='torefs'   name="AnAShtml" publisher="EDP Sciences" suffix='*.txt'/>
  </volume>
  <volume number="DEFAULT">
     <handler type='torefs'   name="AnAlatex" publisher="EDP Sciences" suffix='*.raw'/>
  </volume>
</config>

and update database accordingly.

golnazads commented 1 month ago

Alberto mentioned that it is waste to have the file extension in database, here is a list parsers with the extension they support. I am all for modifying reference file extensions to be unique per parser. raw is the one that is used for many types of parsers: xml(AAS, ICARUS, IOPE3), latex(ADStex), text(ADStxt, arXiv, PThPhTXT), and html(JLVEnHTML, PASJhtml, PASPhtml). Pipeline is bind to classic file structure for now, but as soon as classic goes away, I think extensions can be updated, and then pipeline can follow.

Extension: .raw
Names: AAS, ADStex, ADStxt, arXiv, ICARUS, IOPE3, JLVEnHTML, PASJhtml, PASPhtml, PThPhTXT

Extension: .ocr.txt
Names: ADSocr, ObsOCR

Extension: .tex
Names: AnAtexE2

Extension: .refs
Names: ADStexE3, ADStxtE2, ThreeBibsTxtE3

Extension: .new
Names: ADStexE4

Extension: .urls.raw
Names: ADStxtE3

Extension: .adstagged.raw
Names: ADStxtE4

Extension: .conf.raw
Names: ADStxtE5

Extension: .html
Names: AEdRvHTML, AnRFMhtml, ARAnAhtml, AREPShtml

Extension: .agu.xml
Names: AGU

Extension: .aip.xml
Names: AIP

Extension: .xml
Names: AIPE2, AnA, Blackwell, ELSEVIERE2, IOPE2, NATUREE2, PASA

Extension: .ref.txt
Names: AnAhtml, ThreeBibsTxtE2

Extension: .txt
Names: AnAShtml

Extension: .aps.xml
Names: APS

Extension: .ref.xml
Names: APSE2

Extension: .tagged
Names: APSE3

Extension: .xref.xml
Names: CrossRef

Extension: .cup.xml
Names: CUP

Extension: .edp.xml
Names: EDP

Extension: .egu.xml
Names: EGU

Extension: .elsevier.xml
Names: ELSEVIER

Extension: .iop.xml
Names: IOP

Extension: .iopft.xml
Names: IOPFT

Extension: .ipap.xml
Names: IPAP

Extension: .jats.xml
Names: JATS

Extension: .jst.xml
Names: JSTAGE

Extension: .living.xml
Names: LivingReviews

Extension: .mdpi.xml
Names: MDPI

Extension: .wiley.xml
Names: MNRAS

Extension: .nature.xml
Names: NATURE

Extension: .nlm3.xml
Names: NLM

Extension: .meta.xml
Names: ONCP

Extension: .oup.xml
Names: OUP

Extension: .pairs
Names: PairsTXT

Extension: .isi.pairs
Names: PairsTXTE2

Extension: .atel.pairs
Names: PairsTXTE3

Extension: .editor.pairs
Names: PairsTXTE4

Extension: .pds.pairs
Names: PairsTXTE5

Extension: .rsc.xml
Names: RSC

Extension: .spie.xml
Names: SPIE

Extension: .springer.xml
Names: SPRINGER

Extension: .ref.raw
Names: ThreeBibsTxt

Extension: .ucp.xml
Names: UCP

Extension: .versita.xml
Names: VERSITA

Extension: .wiley2.xml
Names: WILEY