beatrizserrano / galaxy-image-community

Repository to manage project 17 at the BioHackathon Europe 2024
0 stars 1 forks source link

Refactor the BIA retrieval tool #11

Open kostrykin opened 3 months ago

kostrykin commented 3 months ago

There is a tool for downloading images from the Bioimage Archive: https://imaging.usegalaxy.eu/root?tool_id=toolshed.g2.bx.psu.edu/repos/bgruening/bia_download/bia_download/0.1.0+galaxy0

Image


The UI of this tool needs some love:

Optional:

B0r1sD commented 1 week ago

I'm looking into this task.

Tool source: tools/image_processing/bia-ftplinks

Is it helpful to add this tool to the IUC tool repository, so we also make use of their tests and best practices?

B0r1sD commented 1 week ago

Started tracking the progress in this draft PR: https://github.com/bgruening/galaxytools/pull/1541

  • [ ] The help text of the field should say when to use which of the two options.

but I have found info about it in the following places:

And some context: FIRE stands for FIle REplication, EMBL-EBI’s very large-scale object data storage system. This provides long-term sustainable storage, operational redundancy, and backup to tape. Dataset level metadata are stored in a MongoDB database. The system backend is coded in Kotlin.

kostrykin commented 1 week ago

Thanks @B0r1sD!

  • This point is not clear to me yet:
  • [x] The "storage mode" can only be nfs or fire, this should thus be a dropdown field.

Right now "storage mode" is a text field, but according to the help text of the field, only two values are accepted (either nfs or fire). In that case, this field should either be a dropdown field, where one of the two options can be selected? Since any other input value would be invalid and invalid input should be prevented by a good UI.

  • [ ] The help text of the field should say when to use which of the two options.

To me and @beatrizserrano it wasn't immediately clear how to determine the correct value for this input. I see from your explanations what either of the two is, but still, how is the user supposed to determine the correct value for input here? Can we add a help text here to provide some guidance?

B0r1sD commented 1 week ago

To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear. You can find out what is the storage mode for a dataset using this command:
 curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink

kostrykin commented 1 week ago
  • Mail sent to BIA to ask about nfs/fire, exert from reply:

To answer your question, NFS is our older storage system and FIRE is the new one. So, we have some files still in NFS while the newer ones are on FIRE. There is not a recommended mode for FTP download, but you will need to use the correct one for each dataset. Sorry if this was unclear. You can find out what is the storage mode for a dataset using this command:
> curl https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ebi.ac.uk%2Fbiostudies%2Fapi%2Fv1%2Fstudies%2FS-BIAD570%2Finfo&data=05%7C02%7Cboris.depoortere%40vib.be%7C210c86d23fcb4025e91e08dcfd8c3c39%7C2d714a65b97f41a98ff1ec2cdf7df5cd%7C0%7C0%7C638664025957330959%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sV4V1uORpINn7y1pDnhDOh7HylKpCknw0Epra8EWDCA%3D&reserved=0 -s | jq -r .ftpLink

Cool can we use the curl command in the tool wrapper to determine the correct mode automatically?

B0r1sD commented 1 week ago

Yeah that would be ideal.

kostrykin commented 1 week ago

Yeah that would be ideal.

Let me know if you need any help!

kostrykin commented 1 week ago

@B0r1sD What's your current state? We need to report our state tomorrow. It would be ideal if you could tick the boxes! 🥳

B0r1sD commented 5 days ago

Current state: we're in talks with folks from BIA to add a button on their website to seamlessly integrate a data retrieval method similar to some 'Get Data' tools (UCSC, EBI SRA,...).

In the meanwhile, I got an answer about the FIRE/NFs:

right now all our studies have been migrated to FIRE storage. However, we are introducing a new feature that will use NFS as a storage option again. This will mean that soon we’ll have data on NFS and FIRE soon, so you probably want to keep that in mind.

So we decided to keep the dropdown but let the FIRE option be default and thoroughly explain why there are two options (and when to choose what). In the future, when they will reuse NFS, we can look into integrating the curl + jq command that checks if it's FIRE or NFS. This command worked:

curl "https://www.ebi.ac.uk/biostudies/api/v1/studies/S-BIAD570/info" -s | jq '.. | .ftpLink? // empty'

But makes use of their API that is in alpha.

Via the ftp link, the nfs/fire information is not directly included (could get found later if we change the wrapper).

curl "https://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/570/S-BIAD570/" -s
B0r1sD commented 5 days ago

Information communicated to the BIA:

Technical details

General flow

Depending on how the data is fetched at your end, the depositing of data should be either implemented synchronously (docs) or asynchronously (docs). The synchronous implementation is less complex but depending on your backend, this could simply not be an option. @wm75 is an expert in this implementation so can provide technical support where needed.

Code implementation example(s)

The following Github repo contains the example scripts for the implementation on 3 different Python web framework (Cherrypy, Django, Flask): https://github.com/hexylena/galaxy-data_source-examples The lines of code Björn was referring to would look like this in Cherrypy: https://github.com/hexylena/galaxy-data_source-examples/blob/main/cherrypy/server.py, that also comes with documentation: https://github.com/hexylena/galaxy-data_source-examples/tree/main/cherrypy#overview.

Examples

Below is an example how this feature was implemented by the UCSC for their Tablebrowser, from both perspectives.

Data(base) side

Below, two examples of active implementations are shown, which is the relevant perspective for your team.

-UCSC Tableviewer-

      

     

-EBI SRA-

A video (from 2015) showing the workflow and how EBI implemented this on their side for the European Short Read Archive:

https://vimeo.com/121187220

https://usegalaxy.eu/tool_runner/data_source_redirect?tool_id=ebi_sra_main

Image

Galaxy side

This is how the Galaxy tool (or 'wrapper') would look like on Galaxy's side: XML file example for the UCSC Tablebrowser: https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/ucsc_tablebrowser.xml. More technical information on this tool of the 'data source' type can be found here: https://docs.galaxyproject.org/en/latest/dev/data_source.html. This is something we would develop and provide.

B0r1sD commented 5 days ago

The retrieval tool also only works for studies that are part of BioImages - Core collection (with an accession that looks like S-BIAD0000). This is not the only study collection on there so I will document this in the wrapper for now and see how the seamless integration button progresses as this would make this tool obsolete (so I don't see the point now to implement an error catch or feature that works with all types of studies e.g. S-JCBD-201709074).

B0r1sD commented 4 days ago

Having some issues serving the tool locally, the last change is a more verbose help section which I will add later: