Closed michamos closed 12 months ago
Kirsten reports in https://inspirehep.zulipchat.com/#narrow/stream/195298-experts/topic/url.20in.20references/near/390267643 that URLs sometimes get incorrectly duplicated by refextract.
I've narrowed it down to the annotation handling:
In [1]: from refextract.references.pdf import extract_texkeys_and_urls_from_pdf In [2]: res = extract_texkeys_and_urls_from_pdf("/tmp/2303.03819.pdf") In [3]: res Out[3]: [{'texkey': 'Hees-Rapp'}, {'texkey': 'He:2022ywp', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Das-Alam-Mohanty'}, {'texkey': 'Svetitsky:1987gq'}, {'texkey': 'Tsallis'}, {'texkey': 'Marques-Cleymans-Deppman-2015'}, {'texkey': 'Marques-Andrade-Deppman-2013', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'WilkWlodarkzyk-multiparticle', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'TsallisBook', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'PLASTINO1995347', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Muskat', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Schwammle', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Schwammle2009', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'WaltonRafelski', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Wong:2015mba', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Deppman:2019yno', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'PasechnikSumbera', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Adolfsson:2020dhm', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Qin:2015srf', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Apolinario:2015bfm', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Casalderrey-Solana:2018wrw', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'CORADDU2003473', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Curilef', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Annala:2019puf', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Annala:2020rgx', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Cardoso2017', 'urls': {'http://arxiv.org/abs/2204.09299'}}, {'texkey': 'Sen:2021tdu', 'urls': {'http://arxiv.org/abs/2204.09299'}}]
Here only the first occurence is correct, the others are leftovers that should not be present.
This bug should be fixed and all affected INSPIRE records should be identified.
Kirsten reports in https://inspirehep.zulipchat.com/#narrow/stream/195298-experts/topic/url.20in.20references/near/390267643 that URLs sometimes get incorrectly duplicated by refextract.
I've narrowed it down to the annotation handling:
Here only the first occurence is correct, the others are leftovers that should not be present.
This bug should be fixed and all affected INSPIRE records should be identified.