adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Support for version field in ingest #206

Open aaccomazzi opened 5 years ago

aaccomazzi commented 5 years ago

Asclepias needs support for a metadata field named version of type string. It has to be entered as part of the bib data as well as in protobuf and SOLR schema. This will typically contain a string of the kind "vXXX.YYY.ZZZ"

spacemansteve commented 5 years ago

We also need to add 'series'.

spacemansteve commented 5 years ago

Version seems like a pretty general name. Can we use 'asclepias_version' or something else that provides more context?

aaccomazzi commented 5 years ago

Actually this will be a generic field for all versions, including, e.g. what today is provided by arXiv and what will come from "live" papers hosted by publishers such as the AAS.

aaccomazzi commented 5 years ago

Also, and sorry for polluting this thread with SOLR requests, we should modify the schema so that "author_count" is a stored field, as with all other *_count fields

spacemansteve commented 5 years ago

No, things besides asclepias must support versions.

romanchyla commented 5 years ago

@aaccomazzi what is the use of the stored author_count field? you know that we have to be careful about the index size (it translates to higher disk space and more payments to aws, which is just on top my mind - it has all sorts of consequences; and rather than adding more fields, it would be nice to actually remove some; hence the question about the usecase)

aaccomazzi commented 5 years ago

we need the field for:

  1. searching (single author and collaborations)
  2. sorting
  3. easy computation of metrics from API users (without requiring they retrieve long author lists)

I think we may also use it in BBB once we try to speed things up. Right now for any search we retrieve fields which we don't display by default, such as the full author list and the abstract. If we provide a solr option to send just the first N authors (where N is author configurable) then we would want to know how many more there are.

romanchyla commented 5 years ago

scenarios 1 and 2 are already covered with just indexed field (no need to have it also stored)

i'm not aware of a functionality (besides custom functions) that allow you to output only certain number of items from a field, are you? https://lucene.apache.org/solr/guide/6_6/transforming-result-documents.html

number 3 is thus highly speculative ("if we provide") and there is no value in having it until that feature is really needed in real life (too much ado for nothing), but the case would be much stronger if somebody can convincingly show that shavings of large number of bytes (on average) results in faster response and it outweights the changes needed in the backoffice processing, and increase in storage

On Wed, Oct 3, 2018 at 4:51 PM Alberto Accomazzi notifications@github.com wrote:

we need the field for:

  1. searching (single author and collaborations)
  2. sorting
  3. easy computation of metrics from API users (without requiring they retrieve long author lists)

I think we may also use it in BBB once we try to speed things up. Right now for any search we retrieve fields which we don't display by default, such as the full author list and the abstract. If we provide a solr option to send just the first N authors (where N is author configurable) then we would want to know how many more there are.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSImportPipeline/issues/206#issuecomment-426797893, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZIkqW0qj8IHFG_BfjnKuCV0iIVXcn9ks5uhSNKgaJpZM4WsGSU .

aaccomazzi commented 5 years ago

So what kind of storage savings are we talking about here? I just remembered the reason why as an API user I wanted this --testing whether the author count in classic coincided with the one in solr, and ended up having to retrieve the full author list instead. I'm sure there are other cases where this might be equally useful. If we knew the cost in storage and performance we could then make more informed choices.

spacemansteve commented 5 years ago

Version field is deployed. I don't want to close this issue because the author_count field discussion has not been resolved. Do we want to

aaccomazzi commented 5 years ago

I don't see a reason to create a new field (author_count_stored) since it would simply duplicate the purpose of the existing author_count with no additional benefit, apart from being stored. Let's just make author_count stored, there are enough reasons do so and given that it's an integer field I suspect the storage costs are quite modest.