galaxyproject / ephemeris

Library for managing Galaxy plugins - tools, index data, and workflows.
https://ephemeris.readthedocs.org/
Other
28 stars 38 forks source link

usegalaxy.* shared data updater creates duplicates #162

Open hexylena opened 3 years ago

hexylena commented 3 years ago

So this is quite unfortunate. Roughly every time it runs, it creates some duplicates.

$ bash no-dupes.sh
# Checking eu
    Error! Duplicates!
      4 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
      3 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"
# Checking org
    Error! Duplicates!
     52 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
     51 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"
# Checking au
    Error! Duplicates!
     49 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
     53 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"
# Checking fr
    Error! Duplicates!
      2 "/Metabolomics/Msi analyte distribution/DOI: 10.5281/zenodo.484496/Uploaded Composite Dataset (imzml)"
      2 "/Proteomics/Mass spectrometry imaging 1: Loading and exploring MSI data/DOI: 10.5281/zenodo.1560645/Uploaded Composite Dataset (imzml)"

EU has seen this quite prominently, the other servers less so. I'd never seen it on them until I wrote the script to check them just now, and clearly it has been going on for quite some time judging by the counts.

https://github.com/usegalaxy-eu/shared-data/blob/master/no-dupes.sh is the script to check, I'm just dumping the contents of the GTN folder.

Also that API really probably doesn't need enforced authentication, since I can browse those while anonymous on the web.

I can add another script to try and remove duplicates, but, shared data already has one script hacking around upload permissions, another feels like too much duct-tape.

mvdbeek commented 3 years ago

What is the initial script that populates the folders ? Which API routes does it use ?

gmauro commented 3 years ago

@mvdbeek this one https://github.com/usegalaxy-eu/shared-data/blob/master/run.sh

mvdbeek commented 3 years ago

I've moved this to ephemeris then. I don't think there is a logical path towards de-duplication on Galaxy's end (at least not without either a checksum or something else that can identify a piece of data), this should be handled in the setup-data-libraries script IMO.

hexylena commented 3 years ago

Sounds good. Thanks for the move, I should've made it here in the first place.

@Slugger70 @natefoo @lecorguille this issue affects all of you.