bio-raum / FooDMe2

A nextflow pipeline for the identification of species from mixed samples based on mitochondrial amplicons
https://bio-raum.github.io/FooDMe2/
GNU General Public License v3.0
1 stars 1 forks source link

Create a Bioconda package for taxIdtools #1

Closed marchoeppner closed 4 months ago

marchoeppner commented 6 months ago

Our framework must be able to provision every tool as either Conda package or (docker) container. At the moment, taxidtools is only available on conda-forge. This means not container is available.

Two options:

marchoeppner commented 6 months ago

Would taxopy be a suitable alternative? https://github.com/apcamargo/taxopy

Does not require for the NCBI taxonomy to be available locally and has a majority rule function:

taxdb = taxopy.TaxDb()

PARSING BLAST OUTPUT HERE

for key, val in (dict(tuple(dfout.groupby("query")))).items():
    print(key)
    taxa = []

    for index,row in val.iterrows():
        subject = row["subject"]
        taxid = int(subject.split("_")[-1])

        taxon = taxopy.Taxon(taxid,taxdb)
        taxa.append(taxon)

    if len(taxa) > 1:
        lca = taxopy.find_majority_vote(taxa,taxdb)
        print(lca.name)
    elif len(taxa) == 1:
        print(taxa[0].name)
    else:
        print("No hit")
gregdenay commented 6 months ago

Taxopy is missing tree operations (pruning, normalization, filtering) which can have a serious impact on the performance. Loading the whole taxonomy is taking up few minutes each time and pooling all samples kinda defeat the point of parralelization. Tha TaxidTools strategy was to load the Taxdump files once, prune the tree to the minimum and export it as a JSON which can then be quickly loaded for each sample separately. I'd prefer pushing TaxidTools to bioconda. it shouldn't be too much work since it's a native python package.

Taxonkit (https://bioinf.shenwei.me/taxonkit/) would be a good alternative perfomance-wise (it's written in Go) but it misses some functionalities too.

gregdenay commented 6 months ago

https://github.com/bioconda/bioconda-recipes/pull/47556

gregdenay commented 4 months ago

After some digging it appears that it is not possible to just move a package from conda-forge to bioconda:

There are a couple of options:

If we plan on eventually expanding or modifying the tools in the future, we could also deprecate the CVUA-RRW/taxidtools repo in favor of a new repo here that could be published to bioconda.

Thoughts?

marchoeppner commented 4 months ago

Right, thanks for following up on this!

So, the issue with using github actions to build a docker container is that we cannot (easily) guarantee that the conda package and the container are 100% the same package. At least this would be much easier if we did it via Bioconda.

Bummer about conda-forge being so difficult though. I'd be happy to just use "taxonomy" for the time being, since there we don't have to do any work to get it containerized. But I understand if you want to continue using your own tool.

In that case, I would suggest to put a github action into that repo to build a companion container that is linked to each release version (rather than building it here). If you need some inspiration on how that would work: https://github.com/marchoeppner/nf-template/tree/TEMPLATE/dot_github/workflows

Bascially, you need a Dockerhub project, a Dockerhub token and username which you configure as repository secrets in the taxidtools repo and slightly adjust the github workflows I linked.

Happy to discuss offline, if you like.

gregdenay commented 4 months ago
docker pull gregdenay/taxidtools:2.3.1

Still have to keep an eye onthe conda-forge feedstock to see if it follows PyPi releases. THis might take a day or two though, I am not sure how often the automation is running

gregdenay commented 4 months ago

works