Open golnazads opened 1 month ago
Allow users to send classic format config file, for example
<config journal="A+AS">
<volume exclude="1:120" />
<volume number="121:126">
<handler type='retrieve' name="AnAShtml" />
<handler type='torefs' name="AnAShtml" publisher="EDP Sciences" suffix='*.txt'/>
</volume>
<volume number="DEFAULT">
<handler type='torefs' name="AnAlatex" publisher="EDP Sciences" suffix='*.raw'/>
</volume>
</config>
and update database accordingly.
Alberto mentioned that it is waste to have the file extension in database, here is a list parsers with the extension they support. I am all for modifying reference file extensions to be unique per parser. raw
is the one that is used for many types of parsers: xml(AAS, ICARUS, IOPE3), latex(ADStex), text(ADStxt, arXiv, PThPhTXT), and html(JLVEnHTML, PASJhtml, PASPhtml). Pipeline is bind to classic file structure for now, but as soon as classic goes away, I think extensions can be updated, and then pipeline can follow.
Extension: .raw
Names: AAS, ADStex, ADStxt, arXiv, ICARUS, IOPE3, JLVEnHTML, PASJhtml, PASPhtml, PThPhTXT
Extension: .ocr.txt
Names: ADSocr, ObsOCR
Extension: .tex
Names: AnAtexE2
Extension: .refs
Names: ADStexE3, ADStxtE2, ThreeBibsTxtE3
Extension: .new
Names: ADStexE4
Extension: .urls.raw
Names: ADStxtE3
Extension: .adstagged.raw
Names: ADStxtE4
Extension: .conf.raw
Names: ADStxtE5
Extension: .html
Names: AEdRvHTML, AnRFMhtml, ARAnAhtml, AREPShtml
Extension: .agu.xml
Names: AGU
Extension: .aip.xml
Names: AIP
Extension: .xml
Names: AIPE2, AnA, Blackwell, ELSEVIERE2, IOPE2, NATUREE2, PASA
Extension: .ref.txt
Names: AnAhtml, ThreeBibsTxtE2
Extension: .txt
Names: AnAShtml
Extension: .aps.xml
Names: APS
Extension: .ref.xml
Names: APSE2
Extension: .tagged
Names: APSE3
Extension: .xref.xml
Names: CrossRef
Extension: .cup.xml
Names: CUP
Extension: .edp.xml
Names: EDP
Extension: .egu.xml
Names: EGU
Extension: .elsevier.xml
Names: ELSEVIER
Extension: .iop.xml
Names: IOP
Extension: .iopft.xml
Names: IOPFT
Extension: .ipap.xml
Names: IPAP
Extension: .jats.xml
Names: JATS
Extension: .jst.xml
Names: JSTAGE
Extension: .living.xml
Names: LivingReviews
Extension: .mdpi.xml
Names: MDPI
Extension: .wiley.xml
Names: MNRAS
Extension: .nature.xml
Names: NATURE
Extension: .nlm3.xml
Names: NLM
Extension: .meta.xml
Names: ONCP
Extension: .oup.xml
Names: OUP
Extension: .pairs
Names: PairsTXT
Extension: .isi.pairs
Names: PairsTXTE2
Extension: .atel.pairs
Names: PairsTXTE3
Extension: .editor.pairs
Names: PairsTXTE4
Extension: .pds.pairs
Names: PairsTXTE5
Extension: .rsc.xml
Names: RSC
Extension: .spie.xml
Names: SPIE
Extension: .springer.xml
Names: SPRINGER
Extension: .ref.raw
Names: ThreeBibsTxt
Extension: .ucp.xml
Names: UCP
Extension: .versita.xml
Names: VERSITA
Extension: .wiley2.xml
Names: WILEY
On Thu, Jul 11, 2024 at 8:33 AM Shapurian, Golnaz golnaz.shapurian@cfa.harvard.edu wrote:
From googledoc
Alberto, This is related to issue #5 that I showed yesterday and said that I think the mapping is not correct.
Edwin the mapping is done from the reference pipeline side, I will checkout issue #65 in reference service and if it is related to this then I will bring it to the pipeline side.
I thought pipeline could have a command to allow users to add/modify what is known in classic as config files (ie, mapping bibstem/directory/file extension to parser). So I started to think that since curators are familiar with classic's config files, let them create new or revise old config files and submit them to the pipeline to read and update the database.
Thoughts?
golnaz
========================================================================================
On Thu, Jul 11, 2024 at 10:40 AM Accomazzi, Alberto aaccomazzi@cfa.harvard.edu wrote:
I am confused... I always thought the bibstem mappings are part of the reference service (https://github.com/adsabs/reference_service/tree/master/referencesrv/resolver/sourcematcher_dat), not the reference pipeline side.
Either way, we need a clear way to update them. I think being able to send the data mapping data is a good idea, so long as it's simple and secure (only authorized users should be allowed to update the mappings).
-- AA
========================================================================================
On Thu, Jul 11, 2024 at 10:57 AM Shapurian, Golnaz golnaz.shapurian@cfa.harvard.edu wrote:
Maybe this is another case of overused word, lol.
So the mapping I am talking about here is to give the reference file path, map it to a specific parser. The config files that classic uses to go from path to parser is rightly so in the pipeline in the database ADSReferencePipeline/alembic/versions/55d2bf274509_created_db_structure.py at master · adsabs/ADSReferencePipeline (github.com) For example {'name': 'PASJhtml', 'extension_pattern': '.raw', 'reference_service_endpoint': '/text', 'matches': [ {"journal": "PASJ", "volume_begin": 51, "volume_end": 53}, ]},
PASJhtml parser will process all the reference files with extension raw that have the bibstem PASJ with volumes 51 to 53 inclusive. So any of files PASJ/0051/.raw PASJ/0052/.raw PASJ/0053/*.raw will be parsed by PASJhtml.
I know you know all these. What do you call this process that I call mapping?
For authorized/secure users, since this is a pipeline, I am guessing only authorized/secure users have access to it, right?
golnaz
========================================================================================
On Thu, Jul 11, 2024 at 11:14 AM Accomazzi, Alberto aaccomazzi@cfa.harvard.edu wrote:
Ok, then we have been talking about cross-purposes.
It sounds like Edwin was discussing updating mappings between bibstems and journal names (reference_service issue #65), while you are talking about file mapping on the pipeline side (ADSReferencePipeline #5).
Now I understand, and yes, there should be a way to update this type of configuration, although I'm not sure what's best. Ideally we should use file extensions to indicate the format wherever appropriate, the fact that we have so many "*.raw" files that require their own configuration seems wasteful.
-- AA
========================================================================================
Shapurian, Golnaz golnaz.shapurian@cfa.harvard.edu Thu, Jul 11, 11:50 AM (5 days ago)
to Alberto, Edwin
Yes, there are a couple of other instances that reference service issues have been mixed in with reference pipeline feedback.
OK, so, I am neither getting the answer "good idea, go ahead and implement it" nor "might not be needed, why waste time" answer from you Alberto. And so I am going to create the issue in the github, until there are at least 3 requests for update/delete/add to the database for these mappings and then at that point the functionality can be implemented.
golnaz