compomics / searchgui

Highly adaptable common interface for proteomics search and de novo engines
http://compomics.github.io/projects/searchgui.html
42 stars 15 forks source link

Could SearchGUI write somewhere else than in SearchGUI-4.2.8/resources/(conf | temp) ? #349

Open max-l opened 1 year ago

max-l commented 1 year ago

SearchGUI writes temp files and configurations in sub directories of :

public String getJarFilePath() {
    return CompomicsWrapper.getJarFilePath(this.getClass().getResource("SearchGUI.class").getPath(), "SearchGUI");
}

https://github.com/compomics/searchgui/blob/master/src/main/java/eu/isas/searchgui/cmd/SearchCLI.java#L359

It poses a few problems:

  1. To run from a container, we need to copy the files to an external directory (since files within containers are read only )
  2. Multiple runs of SearchGUI can't share the same jars, because they will overwrite each others configs.

It would be nice to be able to override both ./resources/conf ./resources/temp directories, for example with an environement variables

hbarsnes commented 1 year ago

You can change the temp directories via PathSettingsCLI: https://github.com/compomics/searchgui/wiki/SearchCLI#pathsettingscli

If I remember correctly, you can also use the PathSettingsCLI parameters directly as part of a standard SearchCLI command line.

You can also change the path settings using the graphical user interface via the Edit > Resource Settings menu option.

max-l commented 1 year ago

If I set paths with PathSettings cli, or SearchCLI where will it get saved ?

I suppose it gets save in:

./SearchGUI-4.2.8/resources/conf/paths.txt

Can I specify the directory where path.txt is saved ?

If I can't specify it, I don't see how I could run multiple instances of SearchGUI from the same jars, because they will all save in the same ./SearchGUI-4.2.8/resources/conf/paths.txt

hbarsnes commented 1 year ago

If I set paths with PathSettings cli, or SearchCLI where will it get saved ?

I suppose it gets save in:

./SearchGUI-4.2.8/resources/conf/paths.txt

Correct.

Can I also specify the directory where it is saved ?

No, I'm afraid this is not supported at the moment (as it is assumed that you have write access to the directory where SearchGUI is installed).

If I can't specify it, I don't see how I could run multiple instances of SearchGUI from the same jars, because they will all save in the same ./SearchGUI-4.2.8/resources/conf/paths.txt

Indeed you are right. Note the following from the SearchCLI wiki: "Please note that the search engines use indexes and temporary files stored locally in their folders. It is thus important to use a single instance of SearchCLI at a time. In distributed setups, we recommend keeping a clean copy of SearchGUI, and distribute it to the different workers prior to execution."

Not sure how easy it would be to change this. But I'm also not sure how well it would work, given that the included search engines generally assume that they are in control of the distribution of the workload (i.e. with regards to parallell processing etc.). Hence, I'm unsure what could be gained by running more than one instance of the jar file at the time instead of one at the time on a row?

May I ask why you want to run multiple instances of SearchGUI from the same jar (at the same time)?

max-l commented 1 year ago

I have a pipeline that runs thousands of instances of SeachGUI, and also peptide shaker,

In my code I have a directory COMPOMICS_HOME where I store installations of both software.

Every task needs to copy COMPOMICS_HOME in it's bucket (__task_output_dir) before it can run :

copy_compomics_home_cmd = "cp -R $COMPOMICS_HOME $task_output_dir" searchgui_jar_path = '$task_output_dir/compomics/SearchGUI-4.2.8/SearchGUI-4.2.8.jar' peptideshaker_jar_path = '$__task_output_dir/compomics/PeptideShaker-2.2.23/PeptideShaker-2.2.23.jar'

the I can run them: java $searchgui_jar_path ...

It's not the end of the world, but there is this other use case where I have a containerized (https://apptainer.org/) program (custom code that use a couple of libraries, among which there is SearchGUI). The code needs to first copy COMPOMICS_HOME before it can start.

Of course the container will need a temp directory anyways, so what's the big problem with throwing the whole COMPOMICS_HOME in there ?

Again, it's not the end of the world, just a bit a space wasted, and a few extra steps in the code required.

I just find that it would be nicer if there was a separate "config directory", who's location you could decide, separate from both the working directory and (especialy) the code directory, where nothing changes unless you change the version of the software.

max-l commented 1 year ago

Implementation wise, it would be quite easy:

The path itself of resources/conf/paths.txt could be taken from en environment variable only if it is set, othewise it would be taken in the jar path, just like it currently does.

Even better would be to do the same for all the paths inside resources/conf/paths.txt, but that's probably a bit more work.

max-l commented 1 year ago

To further sell the case that a configurable "config dir" would be nice, after I run my 10k instances of SearchGUI+PeptideShaker, I end up deleting the copy of $COMPOMICS_HOME that I have made in order to run them concurrently, because the space adds up.

By deleting the per task COMPOMICS_HOME copy, I am perhaps losing useful debugging info, if I discover junk in the data in a month or two, I might want to investigate if something was wrong in the configs. Perhaps there is nothing there worth investigating in there, but given that config data is usualy very small in size, the cost of keeping it "just in case" is tiny.

BTW, the -use_log_folder 0 in recent versions is really useful, for the same reason: I get to decide where the logs go, because I get to decide where I send stdout, and as a bonus, whenever I need to investigate something, there is only one place to look into. My programs that run before and after SearchGUI also log in that same place, it makes debugging much simpler.

I hope this dooesn't comes accross as pedantic, we really like SearchGui and Peptide Shaker, we really like the way it's evolving, I think the ability to decide where the config files go would greatly improve the user experience (or developper experience !).

hbarsnes commented 1 year ago

I hope this dooesn't comes accross as pedantic, we really like SearchGui and Peptide Shaker, we really like the way it's evolving, I think the ability to decide where the config files go would greatly improve the user experience (or developper experience !).

No worries. We're always happy to get input on how we can further improve our software.

Even better would be to do the same for all the paths inside resources/conf/paths.txt, but that's probably a bit more work.

If you use the temp_folder option as part of your SearchCLI command lines this should set all of the paths to the same folder, hence there should be no need to set the specific paths in addition?

BTW, what happens if you simply provide different temp folder paths via the temp_folder option for each instance? As far as I can tell SearchGUI should then use the provided folder and not the one in the resources folder. I haven't actually tested this though.

max-l commented 1 year ago

The -temp_folder option helps, I use it, but there are still files that are written to in sub directories of:

CompomicsWrapper.getJarFilePath(this.getClass().getResource("SearchGUI.class").getPath(), "SearchGUI")

This is why I need to copy the whole $COMPOMICS_HOME (a directory where I have a SeachGUI and PeptideShaker installation) in the working folder of each of the 15k jobs I run.

Other config directories I am able to override are those that are define in System.getProperty("user.home"), ex:

GENE_MAPPING_FOLDER = System.getProperty("user.home") + "/.compomics/gene_mappings/";

java -Duser.home=

That's very helpful

    JAVA_USER_HOME=$__scratch_dir/compomics

    mkdir -p $JAVA_USER_HOME/.peptideshaker
    cp $__pipeline_code_dir/python-lib/openprot_pipeline/mass_spec/exportFactory.json $JAVA_USER_HOME/.peptideshaker

    java -Duser.home=$JAVA_USER_HOME -cp {peptideshaker_jar_path} eu.isas.peptideshaker.cmd.ReportCLI \\

What would really be nice, is an environement variable, ex: COMPOMICS_CONFIG_HOME where all configs go.

Any config related file would look if this env variable exists, before looking elsewhere /resources/conf/paths.txt

hbarsnes commented 1 year ago

Would it be a viable alternative to add a new specific path option called, for example, -config_folder that allows you to set the config folder via the SearchCLI or PathCLI command lines?

I'm still not sure how the search engines will react though as some of them still write to their local temp folders which is something we cannot override for all of them. Hence we may fix the SearchGUI-specific issues (which in any case is a pluss), but it may just lead to other issues down the line.

max-l commented 1 year ago

I actualy just ran into a problem with the "save compomics configs because they might help debugging later"

I ran jobs in a datacenter where I have limit on the number of files, that is 1 million, and I busted the limit.

So the only place I can copy COMPOMICS_HOME is on $SLURM_TMPDIR, a directory that disapears once the job ends.

So keeping configs around is not even an option, if they are mixed in the whole compomics installation.

max-l commented 1 year ago

-config_folder would be great !

I just thought that an env variable was easyer on your part, because you don't need to pass it from the cli programs, all the way down to every bits of code that use them, you can just sprinkle a few "if var exists" in strategic places. The other advantage, is that all compomics programs can use the same variable, and when you call 7 of them in a row (my case right now) you just need to set it once in the env.

A pattern that I like for CLI toolkits (with multiple cli tools), is that when an argument is common in all tools, you have the choice set it either as en env var, or as command arg, and the arg overrides the env var if both are set.

That being said, -config_folder will definitely help.

hbarsnes commented 1 year ago

I just thought that an env variable was easyer on your part, because you don't need to pass it from the cli programs, all the way down to every bits of code that use them, you can just sprinkle a few "if var exists" in strategic places.

I was thinking the other way around. That it would be easier to add one more variable into the same setup that we already have. As then I can do something similar to what we already do for the log folder:

if (pathSettingsCLIInputBean.getLogFolder() != null) {
     PeptideShakerCLI.redirectErrorStream(pathSettingsCLIInputBean.getLogFolder());
} else {
     PeptideShakerCLI.redirectErrorStream(new File(PeptideShaker.getJarFilePath() + File.separator + "resources"));
}

The other advantage, is that all compomics programs can use the same variable, and when you call 7 of them in a row (my case right now) you just need to set it once in the env.

But won't that only work if you have the same settings for all of them? For example, I think you may end up in trouble if you have different species for the different runs (as the gene mappings will be different). And there may also be issues with more than one instance accessing the same files? So perhaps safer to have one folder per run in order to avoid such potential issues?

Anyway, I'll see what I can do. Probably won't be until after Easter though.

max-l commented 1 year ago

But won't that only work if you have the same settings for all of them? For example, I think you may end up in trouble if you have different species for the different runs (as the gene mappings will be different). And there may also be issues with more than one instance accessing the same files? So perhaps safer to have one folder per run in order to avoid such potential issues?

My thoughts on this are perhaps "philosophical", but I'll share them anyways ;-)

I like functionnal programming very much, in particular the idea that a function is an order of magnitude simpler (to understand, to use, to debug, etc) when it's output is determined only by it's inputs.

That can't be the case if the function has a "memory", because every time you call it, it can "remember" things from the previous call.

In my pipelines, things are much simpler when the programs behave like functions. Containers (ex. Apptainer) are one way to achieve this: they have a read only file system, and you can only write in externaly mapped folders. You get a strong guarantee that the "universe is reset" on every call.

I containerized SearchGUI and PeptideShaker for exactly this purpose. The first problem I had was that the code (jars, etc) could not be "inside" the container (on the read only file system), because it has to write in the same place as it's code.

So in order to "reset the universe on every call", I copy a fresh install of compomics to a folder outside the container, and when it's done, I "delete the universe" so I don't bust my file count limit.

In a desktop environement, it's actualy feature (not a bug) when a software installation "remembers" it's configuration, for pipelines, it's another thing.

In my usage of SearchGUI+PeptideShaker, if I could have everything that isn't config related on a readonly drive (like the code), and the configs either as command line args, env vars or an external drive, then it would behave like a memory less a function !

hbarsnes commented 1 year ago

I've deployed a beta version of SearchGUI that supports the config_folder option here: https://genesis.ugent.be/archiva/repository/maven2/eu/isas/searchgui/SearchGUI/4.2.10-beta/.

However, as far as I can tell the conf folder is not used when using the -temp_folder option (at least not in this new version). But perhaps you can try this beta version and see which files, if any, are still written to the config folder?