galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Data table directories are walked at startup #15307

Open natefoo opened 1 year ago

natefoo commented 1 year ago

Describe the bug On Galaxy load, paths referenced in location files referenced by data tables are fully walked. This can delay startup by minutes to hours if the reference data are on a slow filesystem and contain a huge number of files or directories. Walking just a subset of this data takes 94 minutes on the instance in question.

Does anyone know why we do this? Dynamic options? If it's necessary, can we put some kind of limitation on it or allow tool data tables to define that they should be excluded from this discovery process?

Galaxy Version and/or server at which you observed the bug Galaxy Version: 23.0.dev Commit: latest

To Reproduce Steps to reproduce the behavior:

  1. Install data_manager_gtdbtk_database_installer
  2. Run the DM and select release 207.
  3. Upon completion, restart Galaxy. There will be a long wait on startup, depending on IO performance.

Expected behavior Data table load only inspects files referenced in the location file or limits its traversal in some way.

Screenshots Here's the stack during walking:

root@galaxy-psu-prod:~# ~glxyhpc/py-spy/bin/py-spy dump --pid 2065390 --full-filenames
Process 2065390: /corral/projects/Galaxy-PSU/lrn/galaxy/venv/bin/python3 /corral/projects/Galaxy-PSU/lrn/galaxy/venv/bin/gunicorn galaxy.webapps.galaxy.fast_factory:factory() --timeout 600 --pythonpath lib -k galaxy.webapps.galaxy.workers.Worker -b localhost:8080 --workers=4 --config python:galaxy.web_stack.gunicorn_config --preload
Python v3.9.15 (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/bin/python3.9)

Thread 2065390 (idle): "MainThread"
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:357)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/__python@3.9/lib/python3.9/os.py:418)
    <listcomp> (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:60)
    update_files (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:60)
    tool_data_path_files (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:51)
    exists (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:75)
    configure_and_load (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:518)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:438)
    from_elem (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:295)
    load_from_config_file (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:150)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:97)
    _configure_tool_data_tables (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/app.py:305)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/app.py:620)
    app_pair (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/webapps/galaxy/buildapp.py:59)
    factory (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/webapps/galaxy/fast_factory.py:63)
    import_app (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/util.py:412)
    load_wsgiapp (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py:48)
    load (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py:58)
    wsgi (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/base.py:67)
    setup (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/arbiter.py:118)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/arbiter.py:58)
    run (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/base.py:72)
    run (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/base.py:231)
    run (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py:67)
    <module> (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/bin/gunicorn:8)

Additional context It looks like this goes way back. The process was improved in #3909 but a full walk is still performed.

bernt-matthias commented 1 year ago

Might it be that this is not for the latest version? Stack fits to https://github.com/galaxyproject/galaxy/blob/ca77e1841115991ee56c85dcf065d77dc23b1962/lib/galaxy/tools/data/__init__.py#L60 ie 22.05?

Regardless, if I interpret the code correctly, the tool_data dir is walked to find all the .loc and .loc.sample files. Unfortunately the tool_data dir contains also the data...

If I remember correctly, loc files are actually only in the root of the tool_data dir and in one specific subdir (for the shed tools), or? So we could limit the walk. Alternatively we could separate the dir containing the loc files and the die containing the data...

natefoo commented 1 year ago

Yes, apologies, I put 23.0 in the report since it's still present, but you're right, this is on a 22.05 instance.

Do we need to locate .loc files at all? They should be directly referenced by tool data table config files. Do we support .loc files that aren't referenced by a tool data table?

natefoo commented 1 year ago

Ok, taking a minute to actually look at the stack it looks like the walk is performed when:

  1. the location file path in the tool data table conf is not absolute,
  2. the part of the path path preceding the location file is not equal to tool_data_path, and
  3. the location file is not at the root of tool_data_path.

Kind of funny we need to do this when Galaxy created the location file and should know where it is...

mvdbeek commented 1 year ago

We do also allow for bare loc files in tools, but we could break that and see who screams (probably @blankenberg) ?

natefoo commented 1 year ago

If nothing else we should make it a switch that we eventually default to off, but I guess that is unrelated to this issue.

bernt-matthias commented 1 year ago

I guess I would do a small scream :)

I manage my data_tables for which no data manager exists by creating a xyz.loc file in the tool-data/ dir. For instance I create ./gene_sets.loc and completely ignore all the other datatables

./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/1ebd14235b92/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/eba36e001f45/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/dd4df992d93d/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/76d290331481/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/333a6e13b622/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/46cccc52be5f/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/a37612abf7f9/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/de2c5e5a206c/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/af814359d244/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/9301937c9037/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/ea04b737afa0/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/1759d845181e/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/38b6d12edc68/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/gffcompare/2bb86e2c417f/gene_sets.loc

for which I never understood what they are good for anyway. Their content is merged anyway on startup, or? Would be really cool to understand this.

The reason to do so is that I wanted to manage the data tables automatically with ansible .. and refering the "versioned" data tables seemed to have no additional value.

natefoo commented 1 year ago

Yeah, they are merged. Honestly this feature is really pointless since the loc files installed with toolshed tools are basically never populated but they all have to be read on startup. If you do end up populating one by hand you will probably manually add it to tool_data_table_conf.xml and manage all entries in a single loc file somewhere else. And if there's a DM, it updates the loc file in the DM's repo dir on disk, not any of the tool loc files.

nuwang commented 1 year ago

IMO, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case. In a dev use case, sure, parsing all tool xmls, rebuilding the navigator from scratch etc., may be useful. But in admin use cases, and possibly even in dev use cases, I think it's more likely to be a waste of time. A restart is more likely to be due to some minor config change, to deal with a failure or similar, and less frequently because the navigator menu changed, or the loc files changed. Yet, the startup penalty must be paid by admins each time regardless. This is especially noticeable on larger systems, where it can take forever to startup with a fully loaded navigator.

It would be much nicer to have some management commands like:

galaxy rebuild navigator
galaxy rebuild tool_cache
galaxy rebuild tool_data
galaxy rebuild all

or similar, and maybe a dev mode that would do all these things automatically for those who prefer.

The tool cache for example, dropped startup time drastically (from 3 mins in a cvmfs setup to < 30 seconds), but was unfortunately dropped due to the extra burden as I recall, of maintaining a central cache. Rather, I think it would be better to explicitly hand over the task of building the required caches as needed to the admin. It would be even nicer if we could go a step further and barely touch the filesystem on restarts.

mvdbeek commented 1 year ago

with toolshed tools are basically never populated

That may be true now, maybe! But back when I was using data managers through the UI they would write to the loc file in their tool directory. Proceed with a giant warning here.

, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case

I don't think that's the right way to look at it. We optimize for correctness, and dealing with a cache is, with the current architecture and availability of solutions (sqlite on NFS, looking at you), really complicated. That is my takeaway from having added different flavors of broken caches over the years to address this.

bernt-matthias commented 1 year ago

But back when I was using data managers through the UI they would write to the loc file in their tool directory.

This is still the case on my system. Still the question is what versioned data tables are good for? One thing that comes up to my mind is that it is immediately clear which data manager version produced data. I guess this is already sufficient reason.

Having all the loc files etc in the corresponding dirs of the tools consuming the data tables is probably just a technicality.

Do you think it would be enough (i.e. correct) to traverse only

I guess this would exclude all the data and yield some speedup, or?

mvdbeek commented 1 year ago

I guess this would exclude all the data and yield some speedup, or?

Yes, if carefully done that could work.

natefoo commented 1 year ago

But back when I was using data managers through the UI they would write to the loc file in their tool directory.

The DM loc file in the tool dir is used, yes, but all of the empty loc files for all the tools themselves that reference a data table are also read and will almost surely never contain anything unless an admin decides to populate one by hand for some reason.

abretaud commented 1 year ago

Just add this issue with a veeeery long startup.

Ok, taking a minute to actually look at the stack it looks like the walk is performed when:

1. the location file path in the tool data table conf is not absolute,

2. the part of the path path preceding the location file is not equal to `tool_data_path`, and

3. the location file is not at the root of `tool_data_path`.

Kind of funny we need to do this when Galaxy created the location file and should know where it is...

Looking at https://github.com/galaxyproject/galaxy/blob/ca77e1841115991ee56c85dcf065d77dc23b1962/lib/galaxy/tools/data/__init__.py#LL505C37-L505C37, walk is performed everytime right?

There's a cache but with a 1 second lifetime, not sure if it's intended?

bernt-matthias commented 6 months ago

Looked at this a bit and here are my comments.

I also think that we need to attack at the ToolDataPathFiles.update_files function which locates all the loc files and is executed if there are calls to ToolDataPathFiles.exists that are more than 1sec apart. exists is called multiple times (I think independent of absolute/relative path names, see here and here) during the initialization of Galaxy. I guess even for fast file systems this will lead to multiple calls to the update function during the startup.

My ideas are the following at the moment:

bernt-matthias commented 6 months ago

I was wrong about the last part. It can be controlled with galaxy_data_manager_data_path.

bernt-matthias commented 6 months ago

@natefoo does your tool_data_path contains anything else than tool data tables (.loc and .loc.sample files) and tool data table config files?

bernt-matthias commented 6 months ago

I have a branch ready that would implement a configurable update interval.

But, each tool data table should be referred to by exactly one tool data table config. For each of those there seem to be at most 3 file existence checks -- depending on if the file name is given absolute or relative (here and following lines). So one question might be if those 3 file existence checks can be less efficient than an iteration over a directory containing all the files whose existence is checked (actually it contains more, e.g. unused tool data table config files for each of the shed installed tools and a lot of tool data in the worst case).

It might also be an option to just update cache entries when the exists function is called (and just drop the loop).

For me storing the tool data elsewhere (by using galaxy_data_manager_data_path) solved the problem. But I'm fine if someone thinks that I should open a PR based on the branch or a modification of it. Anyway I'm happy to discuss.