adsabs / resolver_service

Linkout service to provide users with links to external resources such as publisher's full text, data links, etc.
MIT License
1 stars 5 forks source link

blind links #55

Open golnazads opened 4 years ago

golnazads commented 4 years ago

I am actually not in favor of having the same data in multiple places, however, we do have the same information in different formats, for example for esources and data, in both solr and resolver and so I want to purpose that we add the following link_types to resolver to make it "intelligent". Basically, to let resolver create the links that is now creating blindly, to be able to create them only if the link_type actually exists in the resolver db, which includes what we call on the fly links, except for TOC https://github.com/adsabs/resolver_service/blob/master/resolversrv/views.py#L46 and identifications https://github.com/adsabs/resolver_service/blob/master/resolversrv/views.py#L75.

We need to add links_types as follows:

Add link_types citation and references and create records for bibcode in resolver if using [citations] from solr num_references>0 and num_citations>0 respectively.

Add co-read link_type and create record for bibcode in resolver if from solr we have read_count>0.

Add two link_types arxiv and doi and create records for bibcode in resolver filling the url field if there is an arxiv_id in solr identifier and if there is doi. Similar to solr the url field in resolver is an array so it can handle multiple doi strings.

Add link_type similar and create record for bibcode in resolver if there is an abstract in solr.

Add link_type metrics and create record for bibcode in resolver if solr has metrics_mtime field.

The only one remaining is graphics.

Do we need a link_type abstract or if bibcode exists in resolver db, create a abstract link?

aaccomazzi commented 4 years ago

Sounds good in general, but I wonder about implementation. Making the creation of these fields dependent on SOLR means needing to constantly worry about synchronization.

Reading the first part of the proposal my mind was going in the direction of having nonbib provide the additional needed info to the resolver service (citation, references, read_count, are all known to nonbib). But then the remaining fields (arXiv id, doi, similar, metrics) would require data that right now is only in the bib pipeline or solr. Metrics would apply to any valid bibcode, and the existence of graphics is unknown a priori.

An alternate solution comes to mind, which we should at least consider: for the links which require a SOLR query and inspection of the relevant fields, we could consider having the resolver service perform the search and cache the results for a day or so. If I'm correct, most of the accesses will follow Zipf's law (recent and well-cited papers will get a lot of clicks and all others very few) so a cache would be very effective.

golnazads commented 4 years ago

cool. thank you so much.

Populating citation, references, and read_count from nonbib makes resolver more intelligent. Thank you for pointing this out.

Tim mentioned that "so for metrics and graphics [BBB] just makes the calls for them and shows the button as clickable if it gets data back." So I was wondering, would it be OK for resolver service to make calls to metrics and graphics instead of saving information to its database?

So we are left with arXiv id, doi, and similar. I shall ask Steve about these three fields if they can be send to resolver when populating solr.

One thing I just verified is that Steve is taking care of deleting records from resolver db when they get removed from solr. For example when arXiv records merge. Here is an example, 2019arxiv191104406D is counter part of 2020Sci...........D, which does not exist in resolver anymore. So the mechanism to synchronize with solr from master pipeline is in place.

resolver_service=> select * from datalinks where bibcode='2019arxiv191104406D';
 bibcode | link_type | link_sub_type | url | title | item_count 
---------+-----------+---------------+-----+-------+------------
(0 rows)

resolver_service=> select * from datalinks where bibcode='2020Sci...........D';
       bibcode       | link_type | link_sub_type |                                                                                    url                                        
                                            | title | item_count 
---------------------+-----------+---------------+-------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------+-------+------------
 2020Sci...........D | ESOURCE   | EPRINT_HTML   | {https://arxiv.org/abs/1911.04406}                                                                                            
                                            | {}    |          0
 2020Sci...........D | ESOURCE   | EPRINT_PDF    | {https://arxiv.org/pdf/1911.04406}                                                                                            
                                            | {}    |          0
 2020Sci...........D | ESOURCE   | PUB_HTML      | {https://doi.org/10.1126%2Fscience.aax8156,https://doi.org/10.1126%2Fscience.aaz3071,https://doi.org/10.1126%2Fscience.aba2089
,https://doi.org/10.1126%2Fscience.aba3993} | {}    |          0
 2020Sci...........D | ESOURCE   | PUB_PDF       | {http://www.sciencemag.org/content////tab-pdf}                                                                                
                                            | {}    |          0
(4 rows)
romanchyla commented 4 years ago

I hope I'm commenting on the right issue :)

Any desire to avoid data duplication is rather noble, but vain -- where it makes sense, the data must be duplicated. And does it make sense for resolver? It seems so. And should it be as fast as possible? Definitely. Based on what I'm seeing (but maybe I'm confusing that with gateway?)

and the problem of being out of sync? I think it is actually no problem. The data is passing through the master pipeline, so we should modify master pipeline to send the same data to the resolver and to solr at the same time (that should be the default; the mode of running only solr updates was never something that was meant to become the norm; in only became so with usage -- i guess we have some habits to break)

so i guess, my recommendation is: let's treat resolver service as a first class citizen and serve it its own data

golnazads commented 4 years ago

Any desire to avoid data duplication is rather noble, but vain vain? lol.

question. what about metrics and graphics? should the data for metics and graphics go into resolver, it just needs a yes/no information, or should resolver ask each service?

On Wed, Apr 1, 2020 at 9:20 PM Roman Chyla notifications@github.com wrote:

I hope I'm commenting on the right issue :)

Any desire to avoid data duplication is rather noble, but vain -- where it makes sense, the data must be duplicated. And does it make sense for resolver? It seems so. And should it be as fast as possible? Definitely. Based on what I'm seeing (but maybe I'm confusing that with gateway?)

and the problem of being out of sync? I think it is actually no problem. The data is passing through the master pipeline, so we should modify master pipeline to send the same data to the resolver and to solr at the same time (that should be the default; the mode of running only solr updates was never something that was meant to become the norm; in only became so with usage -- i guess we have some habits to break)

so i guess, my recommendation is: let's treat resolver service as a first class citizen and serve it its own data

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/adsabs/resolver_service/issues/55#issuecomment-607567794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG3M4CHTAME7XUDM6UFREPTRKPR5LANCNFSM4LF2TVNQ .

romanchyla commented 4 years ago

data that are directly useful to resolver, should go to resolver - it will hopefully be only a fraction of what metrics stores

On Wed, Apr 1, 2020 at 9:56 PM golnazads notifications@github.com wrote:

Any desire to avoid data duplication is rather noble, but vain vain? lol.

question. what about metrics and graphics? should the data for metics and graphics go into resolver, it just needs a yes/no information, or should resolver ask each service?

On Wed, Apr 1, 2020 at 9:20 PM Roman Chyla notifications@github.com wrote:

I hope I'm commenting on the right issue :)

Any desire to avoid data duplication is rather noble, but vain -- where it makes sense, the data must be duplicated. And does it make sense for resolver? It seems so. And should it be as fast as possible? Definitely. Based on what I'm seeing (but maybe I'm confusing that with gateway?)

and the problem of being out of sync? I think it is actually no problem. The data is passing through the master pipeline, so we should modify master pipeline to send the same data to the resolver and to solr at the same time (that should be the default; the mode of running only solr updates was never something that was meant to become the norm; in only became so with usage -- i guess we have some habits to break)

so i guess, my recommendation is: let's treat resolver service as a first class citizen and serve it its own data

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/adsabs/resolver_service/issues/55#issuecomment-607567794 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AG3M4CHTAME7XUDM6UFREPTRKPR5LANCNFSM4LF2TVNQ

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adsabs/resolver_service/issues/55#issuecomment-607576675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADERERLO4FEKCJ57QJ7NITRKPWDLANCNFSM4LF2TVNQ .

golnazads commented 4 years ago

good morning. thank you so much.

so here is what needs to happen 1- add record with bibcode and link_type abstract for any canonical bibcode 2- remove all records for bibcode if it becomes non-canonical 3- add record with bibcode and link_type citations for any bibcode with num_citations > 0, maybe save the num_citations in the count column, not necessary to keep the count updated with new values, but I think we should. So I think we should populate the count and keep it up to date or not add it, thoughts? 4- add record with bibcode and linktype references for any bibcode with num references > 0 5- add record with bibcode and link_type co_reads for any bibcode with co_reads > 0 6- add record with bibcode and link_type graphics for any bibcode with graphics > 0 7- add record with bibcode and link_type metrics for any bibcode with metrics > 0 8- add record with bibcode, link_type doi, and assign doi to url for any bibcode with doi 9- add record with bibcode, link_type arxiv, and assign arxiv id to url for any arxiv bibcode

so basically we are adding 8 records to resolver per each unique bibcode we have now, in addition to the bibcodes that do not have esources/data/associated/toc/etc and are not in the resolver db now, 8 records for each of those bibcodes.

I thought about becoming creative and combine these records into one, having one digit represent each of the above link_type, it would save space, however makes fetching and storing the record more complicated. Thoughts?

golnazads commented 4 years ago

there are little over 19 million records in resolver db right.

resolver_service=> select count(*) from datalinks;
  count   
----------
 19655830
(1 row)

these include only the 8 types of link_types that we are supporting

resolver_service=> SELECT distinct on (link_type) link_type from datalinks;
   link_type    
----------------

 ASSOCIATED
 DATA
 ESOURCE
 INSPIRE
 LIBRARYCATALOG
 PRESENTATION
 TOC
(8 rows)

now to what I was saying above is for every canonical bibcode we have (from Alberto 14,399,533 in /proj/ads/abstracts/config/bibcodes.list.can) in the worse case (when all the new link types listed above are available for a record) we are adding 14 million * 8 in addition to the 19 million that we have, which bring it to 112+19=131 million records and growing.

aaccomazzi commented 4 years ago

Since it becomes expensive to populate the datalinks table for properties such as these (where no links exist) we should at least consider the possibility of having a separate table which captures "attributes" of the record. Since the table would be designed for individual bibcode lookups, we wouldn't actually need to create an index on anything other than the bibcodes, so we could efficiently store the values we need as a json blob, which leads to a single row for each bibcode in our system e.g. a json structure such as:

        "bibcode":"2019ApJ...887..252M",
        "identifier":["2019arXiv190911682M",
          "2019ApJ...887..252M",
          "10.3847/1538-4357/ab58c3",
          "10.3847/1538-4357/ab58c3",
          "arXiv:1909.11682",
          "2019arXiv190911682M"],
        "citation_count":3,
        "doi":["10.3847/1538-4357/ab58c3"],
        "arXiv":"1909.11682",
        "metrics": true,
        "graphics": true

Here I included the full list of identifiers because this may also provide a good way to resolve non-canonical bibcodes or arXiv ids right off the bat (by indexing that json field).

golnazads commented 4 years ago

thank you so much. I had not thought about the identifier, which YES we need. Also never thought about storing json in database. I would however vote for treating doi and arxiv like the current records that we have. I wonder how slow it is going to be figuring out if we need to update the record, when for example citation_count changes.

golnazads commented 4 years ago

https://docs.google.com/spreadsheets/d/1snAzDY0ptaK7tn5uzPngSbua4HpuepSbtfUQCO_1SPg/edit?usp=sharing

resolver_service=> SELECT pg_size_pretty( pg_total_relation_size('datalinks')); pg_size_pretty

6206 MB (1 row)

resolver_service=> SELECT count(*) from datalinks; count

19666277 (1 row)

romanchyla commented 4 years ago

hi! before proposing certain schema, let me spend some time on the reasons why i think that schema would be beneficial:

curl 'https://ui.adsabs.harvard.edu/v1/resolver/foo' -H 'Authorization: Bearer xxxxxx'

{"action": "display", "links": {"count": 17, "records": [{"url": "/link_gateway/foo/METRICS", "count": 1, "bibcode": "foo", "type": "metrics", "title": "METRICS (1)"}, {"url": "/link_gateway/foo/CITATIONS", "count": 1, "bibcode": "foo", "type": "citations", "title": "CITATIONS (1)"}, {"url": "/link_gateway/foo/REFERENCES", "count": 1, "bibcode": "foo", "type": "references", "title": "REFERENCES (1)"}, {"url": "/link_gateway/foo/SIMILAR", "count": 1, "bibcode": "foo", "type": "similar", "title": "SIMILAR (1)"}, {"url": "/link_gateway/foo/GRAPHICS", "count": 1, "bibcode": "foo", "type": "graphics", "title": "GRAPHICS (1)"}, {"url": "/link_gateway/foo/ABSTRACT", "count": 1, "bibcode": "foo", "type": "abstract", "title": "ABSTRACT (1)"}, {"url": "/link_gateway/foo/OPENURL", "count": 1, "bibcode": "foo", "type": "openurl", "title": "OPENURL (1)"}, {"url": "/link_gateway/foo/COREADS", "count": 1, "bibcode": "foo", "type": "coreads", "title": "COREADS (1)"}, {"url": "/link_gateway/foo/ARXIV", "count": 1, "bibcode": "foo", "type": "arxiv", "title": "ARXIV (1)"}, {"url": "/link_gateway/foo/DOI", "count": 1, "bibcode": "foo", "type": "doi", "title": "DOI (1)"}], "link_type": "all"}, "service": ""}

now imagine how this problem would be solved with two db schemas:

1) current, if expanded with additional information - with 138M entries 2) new schema, with one row per canonical bibcode - 15M entries

what is likely be more expensive? to search/index 138M rows or 15M rows? What is going to be more storage intensive?

my bet is on 1) -- on both accounts; the overhead of managing 10x more rows is likely to take its toll - it is both slower to retrieve all recs (links, subtypes) and to populate the response, as well as it takes longer to index/update the db

right now, they are searched, they have the following structure:

    __tablename__ = 'datalinks'
    bibcode = Column(String, primary_key=True)
    link_type = Column(String, primary_key=True)
    link_sub_type = Column(String, primary_key=True)
    url = Column(ARRAY(String))
    title = Column(ARRAY(String))
    item_count = Column(Integer)

which on itself curious, it means that any given bibcode can have only one entry with link_type and link_sub_type (i.e. there can be no more than 1, and there might be 0) -- this is important

all this information, could be stored in a blob -- all that needs to be searched is the bibcode, retrieve the blob and verify if given link_type is present

But the main argument I'd like to make for the db schema is perhaps simplicity - if only canonical bibcodes are stored, then even deletions become possible

imagine the following schema:

    __tablename__ = 'datalinks'
    bibcode = Column(String, primary_key=True)
    identifiers = Column(array:String, indexed=True)
    links = Column(JSON)
    url = Column(ARRAY(String))
    title = Column(ARRAY(String))
    item_count = Column(Integer)

the individual attributes (even the additional ones) can be stored in that links json; for better performance, if we need to query for those attributes, the JSON can be indexed -- for slightly better performance (during indexing) we would unfurl those attributes -- i.e. every attribute would have its own column

now to the problem of deleted and alternate bibcodes: they can exist (peacefully) in the database until such a time when they get removed by some cleanup process; because every record has canonical bibcode and identifiers it will be found one way or the other (provided that alternate bibcodes get inserted into the identifiers field - which they do). Only if we wanted to do resolution - from alternate to canonical - would we have to do second query; but that is already going past the current functionality (and would be the second improvement - the first being: knowing what does not exist and actually returning 404 for those)

i'm confident master pipeline can support this operation - when new data comes, it can re-construct the appropriate data to be sent to resolver and also call updates/deletes when necessary

golnazads commented 4 years ago

Thank you so much Roman for the suggested approach. Let's chat on Monday. I just thought to list the cases that resolvers deals with right now

1- request for an existence of feature (ie, is there a TOC, and eventually being able to answer is there metrics, graphics, co-reads, etc) 2- grab url for esources, data, and their sub links (for esources, grab url for PUB_PDF for example, for data, grab url for simbad for example) 3- grab associated records for a bibcode

Being able to get one record among many by querying in addition to bibcode, link_type, and link_sub_type has made fetching the specific record very efficient. My understanding was this service needs to be as efficient as possible, timewise, because of the many requests that it has to serve.

Going forward we want to have an identifier included in the database, and right now the way I understand it, someone is going to use noncanonical bibcode to ask for the records associated with canonical bibcode. We need to be able to serve this kind of request very time efficiently.

have a great weekend

On Fri, Apr 10, 2020 at 1:47 PM Roman Chyla notifications@github.com wrote:

hi! before proposing certain schema, let me spend some time on the reasons why i think that schema would be beneficial:

  • use resolver to resolve known and reject unknown resolver is right now not verifying existence of an identifier, so it is possible to make the following request

{"action": "display", "links": {"count": 17, "records": [{"url": "/link_gateway/foo/METRICS", "count": 1, "bibcode": "foo", "type": "metrics", "title": "METRICS (1)"}, {"url": "/link_gateway/foo/CITATIONS", "count": 1, "bibcode": "foo", "type": "citations", "title": "CITATIONS (1)"}, {"url": "/link_gateway/foo/REFERENCES", "count": 1, "bibcode": "foo", "type": "references", "title": "REFERENCES (1)"}, {"url": "/link_gateway/foo/SIMILAR", "count": 1, "bibcode": "foo", "type": "similar", "title": "SIMILAR (1)"}, {"url": "/link_gateway/foo/GRAPHICS", "count": 1, "bibcode": "foo", "type": "graphics", "title": "GRAPHICS (1)"}, {"url": "/link_gateway/foo/ABSTRACT", "count": 1, "bibcode": "foo", "type": "abstract", "title": "ABSTRACT (1)"}, {"url": "/link_gateway/foo/OPENURL", "count": 1, "bibcode": "foo", "type": "openurl", "title": "OPENURL (1)"}, {"url": "/link_gateway/foo/COREADS", "count": 1, "bibcode": "foo", "type": "coreads", "title": "COREADS (1)"}, {"url": "/link_gateway/foo/ARXIV", "count": 1, "bibcode": "foo", "type": "arxiv", "title": "ARXIV (1)"}, {"url": "/link_gateway/foo/DOI", "count": 1, "bibcode": "foo", "type": "doi", "title": "DOI (1)"}], "link_type": "all"}, "service": ""}```

now imagine how this problem would be solved with two db schemas:

1) current, if expanded with additional information - with 138M entries 2) new schema, with one row per canonical bibcode - 15M entries

what is likely more expensive? to search/index 138M rows or 15M rows? What is going to be more storage intensive?

my bet is on 1) -- on both accounts; the overhead of managing 10x more rows is likely to take its toll - it is both slower to retrieve all recs (links, subtypes) to populate the response, as well as it takes longer to index/update the db

  • sparse matrix currently, db schema has row s for bibcode:link_type:subtype - so if there is 14M canonical bibcodes + 5M alternate bibcodes, then there are 19M rows; so while it might seem more economical, it flips around when number of alternate bibcodes increases (notice the information stored for link_tpe and subtype will be identical for alternate and canonical entries). If we were to include other identifiers (dois), then it definitely kills any gains.

  • final argument: link types and subtypes don't need to be searchable

right now, they are searched, they have the following structure:


    __tablename__ = 'datalinks'
    bibcode = Column(String, primary_key=True)
    link_type = Column(String, primary_key=True)
    link_sub_type = Column(String, primary_key=True)
    url = Column(ARRAY(String))
    title = Column(ARRAY(String))
    item_count = Column(Integer)

which on itself curious, it means that any given bibcode can have only one
entry with link_type and link_sub_type (i.e. there can be no more than 1,
and there might be 0) -- this is important

all this information, could be stored in a blob -- all that needs to be
searched is the bibcode, retrieve the blob and verify if given link_type is
present

But the main argument for the db schema is perhaps simplicity - if only
canonical bibcodes are stored, then even deletions become possible

imagine the following schema:

    __tablename__ = 'datalinks'
    bibcode = Column(String, primary_key=True)
    identifiers = Column(array:String, indexed=True)
    links = Column(JSON)
    url = Column(ARRAY(String))
    title = Column(ARRAY(String))
    item_count = Column(Integer)

the individual attributes (even the additional ones) can be stored in that
links json; for better performance, *if* we need to query for those
attributes, the JSON can be indexed -- for slightly better performance
(during indexing) we would unfurl those attributes -- i.e. every attribute
would have its own column

now to the problem of deleted and alternate bibcodes: they can exist
(peacefully) in the database until such a time when they get removed by
some cleanup process; because every record has canonical bibcode and
identifiers it will be found one way or the other (provided that
alternate bibcodes get inserted into the identifiers field - which they
do). Only if we wanted to do resolution - from alternate to canonical -
would we have to do second query; but that is already going past the
current functionality (and would be the second improvement - the first
being: knowing what does not exist and actually returning 404 for those)

i'm confident master pipeline can support this operation - when new data
comes, it can re-construct the appropriate data to be sent to resolver and
also call updates/deletes when necessary

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/adsabs/resolver_service/issues/55#issuecomment-612141430>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AG3M4CCP4WKEYYZUKYFYIQLRL5LTTANCNFSM4LF2TVNQ>
.
golnazads commented 4 years ago

For Friday's discussion: 1- do we need the information about all of citation, references, co-read, similar, metrics, and graphics be available to resolver (ie, can we not list them in link_gateway page, and hence avoid implying that a link exists when resolver does not have the information)? 2- what if we introduce a new table to map bibcode and identifier. letting resolver have access to valid bibcode, alternate_bibcode, arxiv id, and doi, basically address the issue with having a valid abstract, arxiv, and doi url to redirect. 3- not listing the items listed in #1 does not mean that resolver would not serve those links, if they come in, resolver make a blind url as usual, and logs it, assuming the user had the information that the link exist, and if link is not valid, the same 404 page that BBB displays can be displayed from gateway for consistency. talk to you Friday. thank you so much.