cern-sis / issues-inspire

0 stars 0 forks source link

extracting URLs from PDFs in refextract incorrectly duplicates URLs #377

Closed michamos closed 12 months ago

michamos commented 1 year ago

Kirsten reports in https://inspirehep.zulipchat.com/#narrow/stream/195298-experts/topic/url.20in.20references/near/390267643 that URLs sometimes get incorrectly duplicated by refextract.

I've narrowed it down to the annotation handling:

In [1]: from refextract.references.pdf import extract_texkeys_and_urls_from_pdf                                                                                   

In [2]: res = extract_texkeys_and_urls_from_pdf("/tmp/2303.03819.pdf")                                                                                            

In [3]: res                                                                                                                                                       
Out[3]: 
[{'texkey': 'Hees-Rapp'},
 {'texkey': 'He:2022ywp', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Das-Alam-Mohanty'},
 {'texkey': 'Svetitsky:1987gq'},
 {'texkey': 'Tsallis'},
 {'texkey': 'Marques-Cleymans-Deppman-2015'},
 {'texkey': 'Marques-Andrade-Deppman-2013',
  'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'WilkWlodarkzyk-multiparticle',
  'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'TsallisBook', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'PLASTINO1995347', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Muskat', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Schwammle', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Schwammle2009', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'WaltonRafelski', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Wong:2015mba', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Deppman:2019yno', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'PasechnikSumbera', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Adolfsson:2020dhm', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Qin:2015srf', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Apolinario:2015bfm', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Casalderrey-Solana:2018wrw',
  'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'CORADDU2003473', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Curilef', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Annala:2019puf', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Annala:2020rgx', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Cardoso2017', 'urls': {'http://arxiv.org/abs/2204.09299'}},
 {'texkey': 'Sen:2021tdu', 'urls': {'http://arxiv.org/abs/2204.09299'}}]

Here only the first occurence is correct, the others are leftovers that should not be present.

This bug should be fixed and all affected INSPIRE records should be identified.