Visualization plugin for data discovery and acquisition

francoismg commented 1 year ago

In addition to the archive tool, whose interface is not the most adapted for exploring astronomical archives, we started looking into other way to enable users to discover data from the various available archives online.

As suggested by @bgruening the visualization plugin way seems to be the most convenient and most adapted to that use case.

However after some investigation some questions are still pending :

Visualization plugins need to be provided with a dataset in order to be used (if I'm not mistaken) which doesn't apply in the context of a data discovery plugin

Would it be possible to lift that requirement so plugins can be opened without any dataset? Or could some dataset from tool-data or another folder be automatically linked to a plugin so the user could use the tool without having to upload anything?

In order to rely as less as possible on pyvo and network calls, the plugin would rely on a json file which contains information about astronomical archives, could that file be used in replacement of a regular dataset and if yes how?

In order to fetch data from archives we need to be able to make some network calls that will not trigger CORS error and thus must either be originated from the galaxy backend or rely on a CORS compliant proxy server

To address that issue we thought about those possibilities :

Adding a custom endpoint to the galaxy api that would handle pyvo request to archives and send back the results to the plugin

Making a specific dataprovider that would receive adql queries and create a pyvo request based on them

Creating a small adql tool that would be called through the tool api and receive a specific adql query to execute as an input

If the pyvo calls are originated from galaxy the astropy dependency would need to be added to galaxy core (which would also be useful for the fits datatype that relies on astropy as an optional dependency in order to create more useful metadata)

bgruening commented 1 year ago

I will loop in some Vis experts like @guerler.

My feeling is that we should not add domain-specific APIs to Galaxy. If we can go with e dataprovider that would be better imho. But @guerler can judge that better. Do you get in the end a URL that you then put into the upload API?

Creating a small adql tool that would be called through the tool api and receive a specific adql query to execute as an input

This is also possible and the "chart" visualisations do that imho if the number of datapoints are getting too large to plot interactively. In that case they submit a plotting tool to the backend.

How long do you think can such a query be?

francoismg commented 1 year ago

My feeling is that we should not add domain-specific APIs to Galaxy. If we can go with e dataprovider that would be better imho. But @guerler can judge that better. Do you get in the end a URL that you then put into the upload API?

Thanks for the answers.

We will start looking into the dataprovider then and wait for @guerler feedback on that.

Yes the query returns an array of files metadata which contains a download url that we would then use with the upload api (which would also open the possibility to create deferred dataset).

This is also possible and the "chart" visualisations do that imho if the number of datapoints are getting too large to plot interactively. In that case they submit a plotting tool to the backend.

We might end up adding both the API upload and the tool upload, since only the tool way could be added to a workflow (if i'm not mistaken) it might be better to let the user choose which way he prefers.

Also it would require the user to install another tool so it's better if the plugin can work out of the box without depending on another tool and if i'm remembering correctly there's some api that let you detect if a specific tool is installed so we might just gray out that option for users who do not have that tool in their instance.

How long do you think can such a query be?

It's a bit tricky to give you some average duration cause it depends on the query complexity, the number of archives it will be run on and the responsiveness of each archives.

It can take from less than a second for a simple query on one archive to a couple minutes (sometimes more) for complex queries running on multiple archives.

guerler commented 1 year ago

Thank you for the detailed description of your project.

The requirement of providing datasets for visualizations should be relaxed imho, this could also be useful for other visualizations. Implementing this should be straightforward.

I agree with @bgruening that using a data provider is better than adding a custom API. Alternatively, in case a download url is already available on the client, a direct request through the upload API can be made to retrieve the data.

Unfortunately however, the upload tool is not available in workflows, only selected data managers are.

As to the json-file containing the list of astronomical archives, this could potentially be added as yml-file and then accessed by the data provider and/or provided to the visualization.

francoismg commented 1 year ago

Thanks to you for your quick response.

The requirement of providing datasets for visualizations should be relaxed imho, this could also be useful for other visualizations. Implementing this should be straightforward.

Ok that sounds great, should we look into it ourselves and try to make a PR or is it something that one of you will implement?

I agree with @bgruening that using a data provider is better than adding a custom API. Alternatively, in case a download url is already available on the client, a direct request through the upload API can be made to retrieve the data.

Unfortunately however, the upload tool is not available in workflows, only selected data managers are.

Ok so we are set on the data provider, we will look into that then.

Regarding the workflows issues that's something that we could clarify latter when the primary features will be in place, I'm not totally sure what you mean by selected data managers.

As to the json-file containing the list of astronomical archives, this could potentially be added as yml-file and then accessed by the data provider and/or provided to the visualization.

We had already made some script to export archives structure in json format but we can modify it to output yml file if it's more the galaxy way, no problem.

At first I thought that the file could be located in the static directory of the plugin but if I understand you correctly it seems like it would be located elsewhere to be accessible both by the plugin and the dataprovider if needed right? Do you know where that should be? Is there some other plugin, dataprovider or something else working that way that we could follow as an example?

Thanks again for your answers.

guerler commented 1 year ago

@francoismg yes, you are welcome to open a PR to remove the requirement of providing a dataset to visualizations. I am happy to help if needed. Regarding the json containing the archives, it might actually be sufficient to add it to the visualization plugin for now, as you have suggested. We could consider placing it elsewhere as configurable yml-file in the future, if the archive resources may often change, and/or the data provider also requires access.

francoismg commented 1 year ago

@guerler thanks, I was able to lift the frontend requirement so now I can open the plugin page without a dataset. I guess now I have to modify the visualization controller so it can create the plugin without any dataset attached but I don't find any mention of hda except in the trackster, sweepster.... methods and the plugin class and mixins don't seem to deal with hda either. Am I in the right direction? Would be happy to get some pointers, thanks

guerler commented 1 year ago

Nice. Have you looked at the visualizations xml file. There is a data_sources tag which specifies the required dataset. Here is an example: https://github.com/galaxyproject/galaxy/blob/dev/config/plugins/visualizations/editor/config/editor.xml#L5.

The dataset is also evaluated at: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/visualization/plugins/resource_parser.py#L221

francoismg commented 1 year ago

Nice. Have you looked at the visualizations xml file. There is a data_sources tag which specifies the required dataset. Here is an example: https://github.com/galaxyproject/galaxy/blob/dev/config/plugins/visualizations/editor/config/editor.xml#L5.

The dataset is also evaluated at: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/visualization/plugins/resource_parser.py#L221

Thanks for the pointers, I looked at it and I tried a few thing the last few days (set the dataset param required attribute to false, removed the data_sources from xml, removed the hda and dataset evaluation in the resources parser, .....) but plugin rendering keep failling and when I remove everything dataset related in the visualization xml file the plugin is not even loaded and don't appear in the plugin list.

I thought making the dataset parameter not required would help since it seems like when a parameter is not present and not required it is replaced by some default value (https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/visualization/plugins/resource_parser.py#L110) but it didn't work either

I guess the goal should be to be able to have a plugin configuration file that has no mention of data_sources or a dataset parameter linked to it right?

Should https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/visualization/plugins/config_parser.py#L86 be modified to do that? It seems like the case where no parameters are specified is already handled in the config parser.

From what I understand every check and parsing (in plugin.py, controllers/visualization.py, resource_parser.py, .....) that come after are looking for parameters specified in the xml config file so having a config file without those would work right? Does that make sense or I am completely in the wrong direction?

guerler commented 1 year ago

Yes, you would definitely need to remove requiring a dataset in your visualization xml first. Can you open a draft PR with what you have so far, or share the branch? I can take a look.

francoismg commented 1 year ago

Yes, you would definitely need to remove requiring a dataset in your visualization xml first. Can you open a draft PR with what you have so far, or share the branch? I can take a look.

Yes no problem, so far I was just trying to make it work in a hacky way to understand how it works but I will make a new branch and send you the link

esg-epfl-apc / tools-astro

Visualization plugin for data discovery and acquisition #49