galaxyproject / planemo

Command-line utilities to assist in developing Galaxy and Common Workflow Language artifacts - including tools, workflows, and training materials.
https://planemo.readthedocs.io/
MIT License
90 stars 85 forks source link

planemo 0.70 traverses complete subtree even if tools are listed #996

Open bernt-matthias opened 4 years ago

bernt-matthias commented 4 years ago

I like to run planemo test A.xml B.xml ... because sometimes the tool directory is not clean, i.e. there is a lot of other things besides tools, test-data, etc.

Used to work until 0.62.1.

Now planemo needs a large amount of time until anything happens (strace indicates that it tries to copy to tmp) and finally stumbles over a broken symlink in my case.

Traceback (most recent call last):
  File "/home/berntm/.venv3/bin/planemo", line 8, in <module>
    sys.exit(planemo())
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/cli.py", line 190, in handle_blended_options
    return f(*args, **kwds)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/commands/cmd_test.py", line 80, in cli
    runnables = for_paths(paths, temp_path=temp_path)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 149, in for_paths
    return [for_path(path, temp_path=temp_path) for path in paths]
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 149, in <listcomp>
    return [for_path(path, temp_path=temp_path) for path in paths]
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 142, in for_path
    path = _copy_runnable_tree(path, runnable_type, temp_path)
  File "/home/berntm/.venv3/lib/python3.6/site-packages/planemo/runnable.py", line 115, in _copy_runnable_tree
    copy_tree(dir_to_copy, temp_path)
  File "/usr/lib/python3.6/distutils/dir_util.py", line 172, in copy_tree
    verbose=verbose, dry_run=dry_run))
  File "/usr/lib/python3.6/distutils/dir_util.py", line 172, in copy_tree
    verbose=verbose, dry_run=dry_run))
  File "/usr/lib/python3.6/distutils/dir_util.py", line 176, in copy_tree
    dry_run=dry_run)
  File "/usr/lib/python3.6/distutils/file_util.py", line 105, in copy_file
    "can't copy '%s': doesn't exist or not a regular file" % src)
distutils.errors.DistutilsFileError: can't copy '/home/berntm/projects/tools-galaxyp/tools/openms/OpenMS2.4-env/lib/libstdc++.so.6.0.21': doesn't exist or not a regular file

In this case I just need additional software for the automatic creation of some tools. Certainly I could put them somewhere else .. but I don't like to :)

mvdbeek commented 4 years ago

It's a tradeoff. In return this means you can symlink scripts and test data files from / to anywhere ... I'm not sure we can have both, so do you think we should revert https://github.com/galaxyproject/planemo/pull/988 ?

bernt-matthias commented 4 years ago

OK I see. Maybe I need to restructure my project then.

mvdbeek commented 4 years ago

We could also put the new behavior under a switch, so that if you're using a tools-iuc like layout you can use symlinks, while the default would be what it used to be ?

nsoranzo commented 4 years ago

I wasn't particularly happy with #988 because I feared all the copying would be a performance hit, so if anyone has ideas how to solve the symlinks issue without copying everything...

mvdbeek commented 4 years ago

Well, sure, copying the tool tree isn't free, but if you attempt uploading a structure such as that used by @bernt-matthias to the tool shed it would fail too because it is too large. For a normal structure where perhaps you got a couple of MB in test-data you can surely measure a performance hit, but that's nothing compared to starting up Galaxy.

One could replace the symlinks in place of course (prefix the syminks with something and then move them back after the run), but that seems even more likely to cause issues (what if python dies abruptly, or the prefixed file already exists?).

Staging the tool directory on the Galaxy side is another option, but ultimately you'd have the same issue in another place, so that doesn't seem like a good option either.

We could limit the tool tree copying to the first level of the runnable (so tool.xml, macro.xml, some scripts shipped with the tool if they are at the base on the runnable directory, plus the test-data folder recursively). I think I like that option the most, but we can't know if somebody perhaps keeps their scripts in $__tool_directory__/scripts/some_script.py.

bernt-matthias commented 4 years ago

How about just copying files and symlinks located in the root and dirs that are needed (test-data and datamanager rated stuff).

Limiting copying depth is also not a good idea for people that like to add structure to test-data.

Anyway, don't worry to much. I can restructure my project.

bernt-matthias commented 4 years ago

Now I restructured my project. Problem is now that starting up planemo t needs more than 30min to startup on my project (admittedly there is a lot of test-data .. 750MB).

mvdbeek commented 4 years ago

Yeah, we copy the directory for every runnable. That's an easy fix.

jmchilton commented 4 years ago

We should at very least make the behavior in #988 contingent on 1) the engine type being Galaxy, 2) the test data having a symlink outside the tool directory and 3) containers being used for tests - right?

Ugh - this is because we're calling run_tests.sh inside the container right and the tool-util client is coming from Galaxy then even though we've got it available to Planemo as a package. We could call it from outside the container like we do with ephemeris or something right? There is already a ton of progress in that direction...

I can work on some of thing - just want to make sure I understand the problem.

mvdbeek commented 4 years ago

Ugh - this is because we're calling run_tests.sh inside the container right and the tool-util client is coming from Galaxy then even though we've got it available to Planemo as a package.

I don't think that is the issue, run_tests.sh runs on the host, i've written this for the --biocontainers flag.

planemo test A.xml B.xml ...

will trigger the copying once for each tool, let me fix that and then we can see how much of a performance hit that is ? That should be fixed in any case and I'm sorry I didn't do this yet.

2) the test data having a symlink outside the tool directory and 3) containers being used for tests - right?

I guess, but is it worth breaking the isolation we get with https://github.com/galaxyproject/planemo/pull/988 ? When I looked at this I thought the only viable alternative to https://github.com/galaxyproject/planemo/pull/988 is staging with pulsar, but that seems just as expensive ?

jmchilton commented 4 years ago

I don't think that is the issue, run_tests.sh runs on the host, i've written this for the --biocontainers flag.

I see - I was wrong clearly. But Planemo can see the data - it could just upload it as files instead of test data paths. This is how it used to work - isn't there just some flag we can use to not assume the data is next to Galaxy 😆. I feel like we parameterized it that way - I'll try to figure that out.

I guess, but is it worth breaking the isolation we get with #988 ?

I don't get what you mean by isolation - why is isolation good in this context? I guess I'm worrying about like correctness issues but I can't come up with a counter example.

We will just keep iterating on the concept as it is - thanks for #1037

mvdbeek commented 4 years ago

it could just upload it as files instead of test data paths. This is how it used to work - isn't there just some flag we can use to not assume the data is next to Galaxy 😆. I feel like we parameterized it that way - I'll try to figure that out.

That would work! But now that I try to remember the context, I think this was data referenced from a tool_data_table_conf.xml.test file ... ?

I don't get what you mean by isolation - why is isolation good in this context? I guess I'm worrying about like correctness issues but I can't come up with a counter example.

I was thinking of --update_test_data and data managers writing directly to the test data dir ... which is arguably a bug.