Open natefoo opened 1 year ago
Might it be that this is not for the latest version? Stack fits to https://github.com/galaxyproject/galaxy/blob/ca77e1841115991ee56c85dcf065d77dc23b1962/lib/galaxy/tools/data/__init__.py#L60 ie 22.05?
Regardless, if I interpret the code correctly, the tool_data
dir is walked to find all the .loc
and .loc.sample
files. Unfortunately the tool_data
dir contains also the data...
If I remember correctly, loc files are actually only in the root of the tool_data
dir and in one specific subdir (for the shed tools), or? So we could limit the walk. Alternatively we could separate the dir containing the loc files and the die containing the data...
Yes, apologies, I put 23.0 in the report since it's still present, but you're right, this is on a 22.05 instance.
Do we need to locate .loc files at all? They should be directly referenced by tool data table config files. Do we support .loc files that aren't referenced by a tool data table?
Ok, taking a minute to actually look at the stack it looks like the walk is performed when:
tool_data_path
, andtool_data_path
.Kind of funny we need to do this when Galaxy created the location file and should know where it is...
We do also allow for bare loc files in tools, but we could break that and see who screams (probably @blankenberg) ?
If nothing else we should make it a switch that we eventually default to off, but I guess that is unrelated to this issue.
I guess I would do a small scream :)
I manage my data_tables for which no data manager exists by creating a xyz.loc file in the tool-data/
dir. For instance I create ./gene_sets.loc
and completely ignore all the other datatables
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/1ebd14235b92/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/eba36e001f45/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/dd4df992d93d/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/76d290331481/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/333a6e13b622/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/46cccc52be5f/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/a37612abf7f9/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/de2c5e5a206c/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/af814359d244/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/9301937c9037/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/ea04b737afa0/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/1759d845181e/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/38b6d12edc68/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/gffcompare/2bb86e2c417f/gene_sets.loc
for which I never understood what they are good for anyway. Their content is merged anyway on startup, or? Would be really cool to understand this.
The reason to do so is that I wanted to manage the data tables automatically with ansible .. and refering the "versioned" data tables seemed to have no additional value.
Yeah, they are merged. Honestly this feature is really pointless since the loc files installed with toolshed tools are basically never populated but they all have to be read on startup. If you do end up populating one by hand you will probably manually add it to tool_data_table_conf.xml
and manage all entries in a single loc file somewhere else. And if there's a DM, it updates the loc file in the DM's repo dir on disk, not any of the tool loc files.
IMO, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case. In a dev use case, sure, parsing all tool xmls, rebuilding the navigator from scratch etc., may be useful. But in admin use cases, and possibly even in dev use cases, I think it's more likely to be a waste of time. A restart is more likely to be due to some minor config change, to deal with a failure or similar, and less frequently because the navigator menu changed, or the loc files changed. Yet, the startup penalty must be paid by admins each time regardless. This is especially noticeable on larger systems, where it can take forever to startup with a fully loaded navigator.
It would be much nicer to have some management commands like:
galaxy rebuild navigator
galaxy rebuild tool_cache
galaxy rebuild tool_data
galaxy rebuild all
or similar, and maybe a dev mode that would do all these things automatically for those who prefer.
The tool cache for example, dropped startup time drastically (from 3 mins in a cvmfs setup to < 30 seconds), but was unfortunately dropped due to the extra burden as I recall, of maintaining a central cache. Rather, I think it would be better to explicitly hand over the task of building the required caches as needed to the admin. It would be even nicer if we could go a step further and barely touch the filesystem on restarts.
with toolshed tools are basically never populated
That may be true now, maybe! But back when I was using data managers through the UI they would write to the loc file in their tool directory. Proceed with a giant warning here.
, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case
I don't think that's the right way to look at it. We optimize for correctness, and dealing with a cache is, with the current architecture and availability of solutions (sqlite on NFS, looking at you), really complicated. That is my takeaway from having added different flavors of broken caches over the years to address this.
But back when I was using data managers through the UI they would write to the loc file in their tool directory.
This is still the case on my system. Still the question is what versioned data tables are good for? One thing that comes up to my mind is that it is immediately clear which data manager version produced data. I guess this is already sufficient reason.
Having all the loc files etc in the corresponding dirs of the tools consuming the data tables is probably just a technicality.
Do you think it would be enough (i.e. correct) to traverse only
tool-data
andtool-data
I guess this would exclude all the data and yield some speedup, or?
I guess this would exclude all the data and yield some speedup, or?
Yes, if carefully done that could work.
But back when I was using data managers through the UI they would write to the loc file in their tool directory.
The DM loc file in the tool dir is used, yes, but all of the empty loc files for all the tools themselves that reference a data table are also read and will almost surely never contain anything unless an admin decides to populate one by hand for some reason.
Just add this issue with a veeeery long startup.
Ok, taking a minute to actually look at the stack it looks like the walk is performed when:
1. the location file path in the tool data table conf is not absolute, 2. the part of the path path preceding the location file is not equal to `tool_data_path`, and 3. the location file is not at the root of `tool_data_path`.
Kind of funny we need to do this when Galaxy created the location file and should know where it is...
Looking at https://github.com/galaxyproject/galaxy/blob/ca77e1841115991ee56c85dcf065d77dc23b1962/lib/galaxy/tools/data/__init__.py#LL505C37-L505C37, walk is performed everytime right?
There's a cache but with a 1 second lifetime, not sure if it's intended?
Looked at this a bit and here are my comments.
I also think that we need to attack at the ToolDataPathFiles.update_files
function which locates all the loc files and is executed if there are calls to ToolDataPathFiles.exists
that are more than 1sec apart. exists
is called multiple times (I think independent of absolute/relative path names, see here and here) during the initialization of Galaxy. I guess even for fast file systems this will lead to multiple calls to the update function during the startup.
My ideas are the following at the moment:
.loc
files are either located in the root or in the trees names line the used toolsheds. So it should be sufficient to walk those.tool_data
directory (ie the dir containing all the loc files) and the dir containing the actual data. I don't think that is currently possible (could be done manually, but I mean via config variables).I was wrong about the last part. It can be controlled with galaxy_data_manager_data_path
.
@natefoo does your tool_data_path
contains anything else than tool data tables (.loc
and .loc.sample
files) and tool data table config files?
I have a branch ready that would implement a configurable update interval.
But, each tool data table should be referred to by exactly one tool data table config. For each of those there seem to be at most 3 file existence checks -- depending on if the file name is given absolute or relative (here and following lines). So one question might be if those 3 file existence checks can be less efficient than an iteration over a directory containing all the files whose existence is checked (actually it contains more, e.g. unused tool data table config files for each of the shed installed tools and a lot of tool data in the worst case).
It might also be an option to just update cache entries when the exists
function is called (and just drop the loop).
For me storing the tool data elsewhere (by using galaxy_data_manager_data_path
) solved the problem. But I'm fine if someone thinks that I should open a PR based on the branch or a modification of it. Anyway I'm happy to discuss.
Describe the bug On Galaxy load, paths referenced in location files referenced by data tables are fully walked. This can delay startup by minutes to hours if the reference data are on a slow filesystem and contain a huge number of files or directories. Walking just a subset of this data takes 94 minutes on the instance in question.
Does anyone know why we do this? Dynamic options? If it's necessary, can we put some kind of limitation on it or allow tool data tables to define that they should be excluded from this discovery process?
Galaxy Version and/or server at which you observed the bug Galaxy Version: 23.0.dev Commit: latest
To Reproduce Steps to reproduce the behavior:
data_manager_gtdbtk_database_installer
Expected behavior Data table load only inspects files referenced in the location file or limits its traversal in some way.
Screenshots Here's the stack during walking:
Additional context It looks like this goes way back. The process was improved in #3909 but a full walk is still performed.