adsabs / ADSImportPipeline

Data ingest pipeline for ADS classic->ADS+
GNU General Public License v3.0
1 stars 12 forks source link

Adjust pubdates when day resolution not available #253

Open aaccomazzi opened 3 years ago

aaccomazzi commented 3 years ago

Most records in the ADS database have "00" in the day field for a pubdate (e.g. "pubdate":"2020-01-00"). While we have a special case for handling these records when we create the date field (which in this case gets set to 2020-01-01T00:00:00Z), we don't do so with pubdate, which causes the record to be missed when one performs a range query such as pubdate:[2020-01-01 TO 2020-01-31].

I believe we originally resisted manipulating the pubdate because "00" carries the information that the actual day of publication is unknown (as opposed to being exactly the first day of the month), but unless this is handled as a special case at indexing time, the current situation leads to incorrect results. Opinions @romanchyla?

romanchyla commented 3 years ago

Yes, I'd prefer to treat the dates as real ones; the 00 case is there and I actually made it possible to search for it. See the following query: bibgroup:gemini property:refereed database:astronomy pubdate:[2020-01-00 TO 2020-12-31]

vs

bibgroup:gemini property:refereed database:astronomy pubdate:[2020-01-01 TO 2020-12-31]

196 vs 175

The search for -00 is useful for curators; but it gets confusing to users (unless we document it - but we know users often won't see the documentation). Converting the dates on the pipeline seems like a good option (we'll loose the -00 search curator's usecase without data)

As I see it, from the two evils (adding more logic to solr to treat pubdates specially or modifying the pipeline to modify the exported values) the latter seems the smaller, especially if curators get to rely on a backend curation system (once that one is there)

On Fri, Oct 30, 2020 at 10:53 AM Alberto Accomazzi notifications@github.com wrote:

Most records in the ADS database have "00" in the day field for a pubdate (e.g. "pubdate":"2020-01-00"). While we have a special case for handling these records when we create the date field (which in this case gets set to 2020-01-01T00:00:00Z), we don't do so with pubdate, which causes the record to be missed when one performs a range query such as pubdate:[2020-01-01 TO 2020-01-31].

I believe we originally resisted manipulating the pubdate because "00" carries the information that the actual day of publication is unknown (as opposed to being exactly the first day of the month), but unless this is handled as a special case at indexing time, the current situation leads to incorrect results. Opinions @romanchyla https://github.com/romanchyla?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSImportPipeline/issues/253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADEREVZDYWC5LOXWCM2WY3SNLHNXANCNFSM4TFCBZNQ .

aaccomazzi commented 3 years ago

Also to be considered: what do we do for publication dates that lack a month? Clustering them in January may not be what we want because: