Adopt & adapt ODoSE pipeline

timtebeek commented 7 years ago

Hi all,

As discussed on the mailinglist some time ago we kindly request your help to make the ODoSE pipeline available to more users easily. We've discussed this outside the mailinglist with @bgruening and he proposed to create and issue with a follow up pull request to track the adoption and required changes of ODoSE into tools-iuc.

For a bit of background: ODoSE stands for Ortholog Direction of Selection Engine, and is available on these URLs:

http://www.odose.nl/ (frontend)
https://github.com/ODoSE/odose.nl (code)
https://github.com/ODoSE/galaxy (galaxy wrappers)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0062447 (paper)
http://journals.plos.org/plosone/article/figure/image?size=large&download=&id=10.1371/journal.pone.0062447.g001 (diagram)

Roughly the functionality comes down to this:

select/upload ~3 to 20 bacterial genomes
blast all-vs-all
determine single copy orthologs with OrthoMCL
concatenate all orthologs to determine clades (A vs B)
filter orthologs on a number of criteria
run a few tools for every ortholog
output tool values and statistics to CSV

This project was started in 2011, and as such precedes quite a few of the Galaxy enhancements since. It:

still uses zips-of-files rather than multi file inputs / collections,
does not yet use repository dependency management; instead relying on system-wide installation of (fairly) common packages
does not have tests in the xml tool wrappers yet
does not use exit code for failures as that wasn't yet an option
uses external tools from various sources (codeml, orthomcl, blast, phipack and more)

In addition to the above challenges there's also been a few updates to the way NCBI offers it's genomic contents for download, so that part of the pipeline will also need revision. I've also seen that there's a lot more tools and interesting combinations already on hand in Galaxy nowadays, for instance for the blast-all-vs-all step in our pipeline, that could potentially / hopefully replace parts of what we now have.

One issue with the adoption is that while I am the original developer, and would love to help out where possible, I can only do so in my limited spare time. So any help in reducing the workload would be very welcome; be it with pointing to the right resources, examples, suggest new shortcuts or even taking on coding/wrapping duties.

Any help is much appreciated! Best, Tim te Beek + Michiel Vos

timtebeek commented 7 years ago

Some suggestions that we've received already:

bgruening commented 7 years ago

Hi @timtebeek! Thanks for writing this down. Would you like to submit what you have as a PR or should we just take it? Any interest to learn conda packaging? This would be the first step.

Merry Christmas!

timtebeek commented 7 years ago

Hi Björn, Best of wishes for the new year!

I'd not yet put any files in a pull request as I assumed the code would need quite a bit of changes as is, and I didn't want to pollute your git history too much. You're welcome to take on what you want already, or we can work on it in the separate repository before we pull it into this. Whatever you think is best!

timtebeek commented 7 years ago

@bgruening how can be best restart/continue this effort?

bgruening commented 7 years ago

@timtebeek I looked at it in more details a few weeks ago and it seems like more work. I started with a conda package for the moment.

timtebeek commented 7 years ago

Great to hear you're looking into it! :) Anything I can help out with for the moment?

timtebeek commented 7 years ago

Hi @bgruening , interesting to read the discussion in #1230; I'd failed to pick up on that as GitHub doesn't notify for issue mentions.. Thanks for advocating for not-yet-perfect tools ;) For this particular pipeline I'd also be more than willing to adopt a template external repository if that makes it almost just as easy to install in a newer Galaxy instance.

To the point for this project: I think I could more easily kickstart development if there were an easy way to select / download bacterial genomes from NCBI again, in a non-hacky, non-breaking way. I'd already been referred to https://github.com/kblin/ncbi-genome-download, which seems perfect for the job initially. All it would need is proper wrapping in a 2017's galaxy xml tool with proper output. Collection of DNA fasta files maybe? With that in place I could try to align all the downstream tools, making better use of Galaxy tools added in the past five years.

Is the above genome selection tool something you could help with? I could see this having a broader applicability than then pipeline as a whole, so it might more easily pass the bar for inclusion. Hope to hear from you!

andrewsanchez commented 7 years ago

https://github.com/andrewsanchez/NCBITK makes downloading bacterial genomes effortless and quick. I run a cron job that keeps me up to date with the latest assembly versions. It has tons of room for improvement but gets the job done for now.

HassanAmr commented 7 years ago

Hi @timtebeek,

ncbi-genome-download is available in bioconda and can be downloaded using conda install -c bioconda ncbi-genome-download.

It can also be directly used in a Galaxy as follows:

<requirements>
    <requirement type="package" version="0.2.4">ncbi-genome-download</requirement>
</requirements>

Then simply follow its usage instructions in the command section. If you have a project up already, I would be happy to contribute to it. If not, I can start one and communicate with you directly regarding how you wish to proceed with it.

timtebeek commented 7 years ago

@HassanAmr Awesome thanks! I'd wanted to look into it this weekend, but it's hard to find the time these days.. The current project is up here: https://github.com/ODoSE

I'm not yet entirely sure what you mean by:

Then simply follow its usage instructions in the command section.

What's needed most right now, I think, is an easy way for users to select 5~20 genomes from a list. In the old days I used dynamic_options combined with parsing and NCBI FTP file, but that has been both deprecated in Galaxy and broken down because NCBI updated their FTP layout.

I'm not entirely sure what the best user interface would be at this point: Ideally users will select one or two groups of genomes from NCBI, optionally add in some of their own DNA FASTA files, and start off the pipeline with that. I think it would be best to adopt the new Galaxy multi-file-output format. From there on we/I can try to recompose the workflow making better use of now-provided Galaxy tools.

Is any of this something you could help out with? I'll try to get the original investigator to respond here with regards to the best way to select genomes for his use-case. Looking at ncbi-genome-download the options are by user provided name, NCBI species taxonomy ID, NCBI taxonomy ID, and possibly again like before, parsing a full list of NBCI genomes from taxdump.tar.gz for instance and showing text selection boxes based on that. Any comment here Michiel?

Edit: Turns out Michiel is on holiday until August 7th, so might take him a while to respond.

HassanAmr commented 7 years ago

@timtebeek, Yes, of course, I will help. I am currently experimenting with ncbi-genome-download in a new galaxy tool. I will update you when I manage to achieve some progress. And when Michiel is back, I can accommodate the requirements as necessary.

I meant that since ncbi-genome-download and 'odose` are already on Bioconda, then using them will be as running them on your own terminal for example (as long they are in the tool requirements).

timtebeek commented 7 years ago

@HassanAmr that's awesome! Should be great to have a separate tool to select genomes, as I could see that having a wider applicability than just the ODoSE pipeline. Is there anywhere I can follow along to progress on this? That way I'll get up to speed a bit more on recent Galaxy usage and best practices.

bgruening commented 7 years ago

@timtebeek do you have a change to look at these tools: https://github.com/galaxyproject/tools-iuc/tree/master/tools/ncbi_entrez_eutils

They can download genomes and we can use these before going into the Odose pipeline I think.

HassanAmr commented 7 years ago

@timtebeek, Currently, there isn't much to show. But I will share any significant progress as soon as possible. Could you please list the steps with examples of how the pipeline should work in both cases, With Inter-Taxon Recombination Filtering and without?

timtebeek commented 7 years ago

@bgruening I've setup a local instance with the Entrez tools and tried to use them, but so far failed to see the option to download complete genomes as coding DNA fasta files. Perhaps I'm not using the tools right; I've no experience with Entrez to know for sure. Back in the day I just downloaded RefSeq/GenBank entries directly. Can you tell me what tools in particular you thought could be of use and how to use them?

timtebeek commented 7 years ago

@HassanAmr had you seen the galaxy workflows stored along the galaxy tools? https://github.com/ODoSE/galaxy/tree/master/workflow These give a hint as to what each step produces and how it feeds back into the next. I now regret not exporting a screenshot of the workflow before the server was migrated, as these are kind of hard to read. Maybe they can still be imported these days?

In addition there's the manual from two years ago: https://github.com/ODoSE/odose.nl/blob/master/docs/ODoSE%20manual%202015.doc And potentially some helpful graphics in: http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0062447 Do you need more to help you get started?

peterjc commented 7 years ago

Entrez efetch ought to be able to return complete genomes as DNA files, but for this task something specifically for this like (the previously mentioned) https://github.com/kblin/ncbi-genome-download using ftp.ncbi.nih.gov/genomes might be more reliable? Integrating that into Galaxy might be a challenge given some clever use of collections for the outputs might be best.

michielvos010 commented 7 years ago

Hi all,

great to read your comments and please bear with me as a non-(bio)informatician. Our pipeline serves a simple goal: to allow microbiologist to perfom a variety of population genetic tests on a set of prokaryote genomes of interest using a GUI. There are three main scenario's of usage: 1) users have their own set of sequenced genomes, 2) users analyze NCBI-deposited genomes or 3) a mixture of the two. Probably scenario's 1 and 3 (users adding a reference genome to their new data) are most common.

The main problem we had was the compatibility of the odose pipeline with the (everchanging) NCBI database. If we stick to NCBI as the reference genome repository (probably best), some way has to be devised of reliably extracting fasta files from NCBI and uploading to odose. We had a list of all NCBI genomes where appropiate genomes could be ticked and added to the pipeline, as well as a step where users would upload their own fasta files, after which the two sets of data were analyzed together.

I am not sure which specific queries you have at this point but I am eager to help wherever I can. I will check back here or email me directly at m.vos@exeter.ac.uk. Many thanks for helping to kickstart our pipeline; I am eager to start using it again and I know many other people are also very interested!

Best Wishes, Michiel

peterjc commented 7 years ago

Versioning of NCBI genome databases is hard - we have the same problem with the NCBI BLAST databases, and gave up and just have a local mirror which is updated periodically to match the latest one.

michielvos010 commented 7 years ago

That sounds very sensible, especially when this mirror is already in place for other projects.

peterjc commented 7 years ago

@michielvos010 Exactly - we accepted a lost of reproducibility due to the impracticalities of keeping snapshot versions of the BLAST databases. At least with genomes, each NCBI genome is strictly versioned - but I'm not sure what to suggest for your use case.

michielvos010 commented 7 years ago

The associated loss of reproducility is completely acceptable! In my laymans terms: the mirrored database would have to show up as a selectable list of genomes in odose ('hiding the annotated multifastafiles'). Minor point: As genomes sequenced as part of different studies can be annotated differently, we used the most recently annotated genome as a reference to 'transfer' annotations to all other genomes.

timtebeek commented 7 years ago

Hi all; Small question on the current state of affairs.

I see there's now bioconda recipes for ncbi-genome-download and genomepy. Both appear to be able to download genomic data given correct arguments.

Is there any (partial) tool or wrapper in Galaxy that will allow a user to enter for example a name or series of accessions, which will result in fasta sequence files in the history for further analysis with ODoSE?

bgruening commented 7 years ago

@timtebeek nice find. Have you tested those? Which do you recommend? We can get a wrapper for it I think. I would like just to be sure where to spend time :)

timtebeek commented 6 years ago

Hi @bgruening, ncbi-genome-download seems a good fit for this project, although it might need a bit of input sanitation to prevent users from downloading all ncbi genomes inadvertently (or even intentionally). If it were to use a shared local cache that might not even be an issue, although the pipeline doesn't handle anything over a couple dozen very well.

genomepy meanwhile seems more broader applicable and might be able to download the same ncbi genomes, although I have not tested it. Some functionality there might duplicate the various genome browsers, that provide a better user experience.

The only question mark right now is the interface to expose to the user; Previously we showed a full list of genomes, from which the users could select their relevant genomes. I'm guessing @michielvos010 would prefer something similar, although we could live with entering one genus and filtering out irrelevant genomes later.

michielvos010 commented 6 years ago

Hey Tim, I am just speaking with Hassan and we agreed that he will a) check out ROARY (https://sanger-pathogens.github.io/Roary/) to lift a reciprocal blast step from and b) get started with ncbi-genome-download.

timtebeek commented 6 years ago

Looks like an awesome tool and good replacement for OrthoMCL.. so much easier to use and incorporate too! Good thing the rest of the pipeline is nice and modular, so hope it's easy to slot in there. Great work @HassanAmr !

galaxyproject / tools-iuc

Adopt & adapt ODoSE pipeline #1089