Open kostrykin opened 3 months ago
I'm looking into this task.
Tool source: tools/image_processing/bia-ftplinks
Is it helpful to add this tool to the IUC tool repository, so we also make use of their tests and best practices?
Started tracking the progress in this draft PR: https://github.com/bgruening/galaxytools/pull/1541
Added EDAM ontology:
This point is not clear to me yet:
- [x] The "storage mode" can only be
nfs
orfire
, this should thus be a dropdown field.
- [ ] The help text of the field should say when to use which of the two options.
but I have found info about it in the following places:
And some context: FIRE stands for FIle REplication, EMBL-EBI’s very large-scale object data storage system. This provides long-term sustainable storage, operational redundancy, and backup to tape. Dataset level metadata are stored in a MongoDB database. The system backend is coded in Kotlin.
- [x] Why are both fields optional? Is that correct?
Thanks @B0r1sD!
- This point is not clear to me yet:
- [x] The "storage mode" can only be
nfs
orfire
, this should thus be a dropdown field.
Right now "storage mode" is a text field, but according to the help text of the field, only two values are accepted (either nfs
or fire
). In that case, this field should either be a dropdown field, where one of the two options can be selected? Since any other input value would be invalid and invalid input should be prevented by a good UI.
- [ ] The help text of the field should say when to use which of the two options.
To me and @beatrizserrano it wasn't immediately clear how to determine the correct value for this input. I see from your explanations what either of the two is, but still, how is the user supposed to determine the correct value for input here? Can we add a help text here to provide some guidance?
To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear. You can find out what is the storage mode for a dataset using this command: curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink
#!/bin/bash
# Run this file in bash with this command: ./filename
HOST=ftp.ebi.ac.uk
USER=anonymous
ftp -pinv $HOST <<EOF
user $USER
cd biostudies/fire/S-BIAD/458/S-BIAD1458/Files
binary
mget "Red blood cell differential image data/data/0-0.3/0(11).jpg"
mget "Red blood cell differential image data/data/0-0.3/0(2).jpg"
disconnect
bye
EOF
- Mail sent to BIA to ask about nfs/fire, exert from reply:
To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear. You can find out what is the storage mode for a dataset using this command: > curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink
Cool can we use the curl command in the tool wrapper to determine the correct mode automatically?
Yeah that would be ideal.
Yeah that would be ideal.
Let me know if you need any help!
@B0r1sD What's your current state? We need to report our state tomorrow. It would be ideal if you could tick the boxes! 🥳
Current state: we're in talks with folks from BIA to add a button on their website to seamlessly integrate a data retrieval method similar to some 'Get Data' tools (UCSC, EBI SRA,...).
In the meanwhile, I got an answer about the FIRE/NFs:
right now all our studies have been migrated to FIRE storage. However, we are introducing a new feature that will use NFS as a storage option again. This will mean that soon we’ll have data on NFS and FIRE soon, so you probably want to keep that in mind.
So we decided to keep the dropdown but let the FIRE option be default and thoroughly explain why there are two options (and when to choose what). In the future, when they will reuse NFS, we can look into integrating the curl + jq command that checks if it's FIRE or NFS. This command worked:
curl "https://www.ebi.ac.uk/biostudies/api/v1/studies/S-BIAD570/info" -s | jq '.. | .ftpLink? // empty'
But makes use of their API that is in alpha.
Via the ftp link, the nfs/fire information is not directly included (could get found later if we change the wrapper).
curl "https://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/570/S-BIAD570/" -s
Information communicated to the BIA:
General flow
Depending on how the data is fetched at your end, the depositing of data should be either implemented synchronously (docs) or asynchronously (docs). The synchronous implementation is less complex but depending on your backend, this could simply not be an option. @wm75 is an expert in this implementation so can provide technical support where needed.
The following Github repo contains the example scripts for the implementation on 3 different Python web framework (Cherrypy, Django, Flask): https://github.com/hexylena/galaxy-data_source-examples The lines of code Björn was referring to would look like this in Cherrypy: https://github.com/hexylena/galaxy-data_source-examples/blob/main/cherrypy/server.py, that also comes with documentation: https://github.com/hexylena/galaxy-data_source-examples/tree/main/cherrypy#overview.
Below is an example how this feature was implemented by the UCSC for their Tablebrowser, from both perspectives.
Below, two examples of active implementations are shown, which is the relevant perspective for your team.
-UCSC Tableviewer-
The tool redirects to the following link, where you can see the GALAXY_URL parameter: https://genome.ucsc.edu/cgi-bin/hgTables?GALAXY_URL=https%3A//usegalaxy.eu/tool_runner&tool_id=ucsc_table_direct1&sendToGalaxy=1&hgta_compressType=none&hgta_outputType=bed
Here you can see how UCSC implemented the option buttons on their webpage.
-EBI SRA-
A video (from 2015) showing the workflow and how EBI implemented this on their side for the European Short Read Archive:
https://usegalaxy.eu/tool_runner/data_source_redirect?tool_id=ebi_sra_main
This is how the Galaxy tool (or 'wrapper') would look like on Galaxy's side: XML file example for the UCSC Tablebrowser: https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/ucsc_tablebrowser.xml. More technical information on this tool of the 'data source' type can be found here: https://docs.galaxyproject.org/en/latest/dev/data_source.html. This is something we would develop and provide.
The retrieval tool also only works for studies that are part of BioImages - Core collection (with an accession that looks like S-BIAD0000
). This is not the only study collection on there so I will document this in the wrapper for now and see how the seamless integration button progresses as this would make this tool obsolete (so I don't see the point now to implement an error catch or feature that works with all types of studies e.g. S-JCBD-201709074).
Having some issues serving the tool locally, the last change is a more verbose help section which I will add later:
Storage mode FIle REplication or FIRE is EMBL-EBI’s very large-scale object data storage system. At the moment of writing, all their studies have been migrated to FIRE storage hence it being the default option. However, they are introducing a new feature that will use NFS as a storage option, so the study you are referring to might live on NFS in the near future. This is the reason both option are available.
Accession number:
This tool only supports studies part of the 'BioImages - Core' collection (with an accession number that follows the S-BIAD0000
pattern).
There is a tool for downloading images from the Bioimage Archive: https://imaging.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/bia_download/bia_download/0.1.0+galaxy0
The UI of this tool needs some love:
nfs
orfire
, this should thus be a dropdown field.Optional: