Open aaccomazzi opened 3 years ago
Yes, I'd prefer to treat the dates as real ones; the 00 case is there and I actually made it possible to search for it. See the following query: bibgroup:gemini property:refereed database:astronomy pubdate:[2020-01-00 TO 2020-12-31]
vs
bibgroup:gemini property:refereed database:astronomy pubdate:[2020-01-01 TO 2020-12-31]
196 vs 175
The search for -00 is useful for curators; but it gets confusing to users (unless we document it - but we know users often won't see the documentation). Converting the dates on the pipeline seems like a good option (we'll loose the -00 search curator's usecase without data)
As I see it, from the two evils (adding more logic to solr to treat pubdates specially or modifying the pipeline to modify the exported values) the latter seems the smaller, especially if curators get to rely on a backend curation system (once that one is there)
On Fri, Oct 30, 2020 at 10:53 AM Alberto Accomazzi notifications@github.com wrote:
Most records in the ADS database have "00" in the day field for a pubdate (e.g. "pubdate":"2020-01-00"). While we have a special case for handling these records when we create the date field (which in this case gets set to 2020-01-01T00:00:00Z), we don't do so with pubdate, which causes the record to be missed when one performs a range query such as pubdate:[2020-01-01 TO 2020-01-31].
I believe we originally resisted manipulating the pubdate because "00" carries the information that the actual day of publication is unknown (as opposed to being exactly the first day of the month), but unless this is handled as a special case at indexing time, the current situation leads to incorrect results. Opinions @romanchyla https://github.com/romanchyla?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/adsabs/ADSImportPipeline/issues/253, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADEREVZDYWC5LOXWCM2WY3SNLHNXANCNFSM4TFCBZNQ .
Also to be considered: what do we do for publication dates that lack a month? Clustering them in January may not be what we want because:
Most records in the ADS database have "00" in the day field for a pubdate (e.g.
"pubdate":"2020-01-00"
). While we have a special case for handling these records when we create thedate
field (which in this case gets set to2020-01-01T00:00:00Z
), we don't do so withpubdate
, which causes the record to be missed when one performs a range query such aspubdate:[2020-01-01 TO 2020-01-31]
.I believe we originally resisted manipulating the pubdate because "00" carries the information that the actual day of publication is unknown (as opposed to being exactly the first day of the month), but unless this is handled as a special case at indexing time, the current situation leads to incorrect results. Opinions @romanchyla?