cmks / DAS_Tool

DAS Tool
Other
135 stars 17 forks source link

[Feature Request] Custom single copy marker gene sets #69

Open jolespin opened 3 years ago

jolespin commented 3 years ago

I want to use DASTool for Eukaryotic and Candidate phyla radiation bin refinement. I planned on using the Protista_83 and CPR_43 marker sets.

Is this possible already? If not, can this be a feature in the future?

Would this be difficult?

jolespin commented 3 years ago

If this isn't in the cards, do you have any suggestions on how to hack the code so this will work?

cmks commented 3 years ago

Hi @jolespin, After doing some major code refactoring, I’ve now actually been starting to build a prototype which can handle custom marker gene sets. It still needs some more work but I think I’ll have a version ready which you can take for a spin by next week or so.

jolespin commented 3 years ago

That's awesome! More than willing to help test this out if you need some outside prototyping.

cmks commented 3 years ago

If you check out the dev_customSCG branch, you can try running DAS Tool using your own single copy gene sets. I've included a description and some examples in the README:customSCG. Note that the command line and dependencies (HMMER3 and some R packages; see README:dependencies) have changed a bit. As of now, the file formats of the SCG sets in the ANVIO repo are supported. However, I have not tested how well the bin selection performs using these sets, so you may want to verify the results for yourself.

Also, it is probably not a good idea to use a CPR specific marker gene set set as these genes should be already covered by the default set of bacterial SCG. Using them without or even in combination with the bacterial set could lead to a completeness overestimation of non-CPR species.

In any case, I'm curious to hear how well the protist-bin selection works! Also, let me know if you find any bugs.

jolespin commented 3 years ago

I'm going to try it out tomorrow morning and getting it installed right now. I made a conda environment w/ DAS_Tool and just downloaded the dev version. Which files do I need to replace to install in my environment from the repository?

Here's my command:

./DAS_Tool -i 47-Drifterexpttime4punches_S40/intermediate/binning_metabat2_output/scaffolds_to_bins.tsv, \
                47-Drifterexpttime4punches_S40/intermediate/binning_concoct_output/scaffolds_to_bins.tsv \
             -l metabat2,concoct \
             -c scaffolds.fasta \
             -o DASToolCustomScgRun02 \
             --threads 16 \
             --customDbDir ../../../db/HMMER/Bacteria_71,../../../db/HMMER/Archaea_76,../../../db/HMMER/Protista_83 \
             --useCustomDbOnly \
             --search_engine diamond

Here's the error:

(dastool_dev_env) jespinoz@jespinozlt2-osx DAS_Tool-dev_customSCG % bash cmd.sh
Error: DAS Tool

Usage:
  DAS_Tool [options] -i <contig2bin> -c <contigs_fasta> -o <outputbasename>
  DAS_Tool -i <contig2bin> -c <contigs_fasta> -o <outputbasename> [--labels=<labels>] [--proteins=<proteins_fasta>] [--threads=<threads>] [--search_engine=<search_engine>] [--score_threshold=<score_threshold>] [--dbDirectory=<dbDirectory> ] [--useCustomDbOnly] [--customDbFormat] [--customDbDir=<customDbDir>] [--megabin_penalty=<megabin_penalty>] [--duplicate_penalty=<duplicate_penalty>] [--write_bin_evals] [--create_plots] [--write_bins] [--write_unbinned] [--resume] [--debug]
  DAS_Tool [--version]
  DAS_Tool [--help]

Options:
   -i --bins=<contig2bin>                   Comma separated list of tab separated contigs to bin tables.
   -c --contigs=<contigs>                   Contigs in fasta format.
   -o --outputbasename=<outputbasename>     Basename of output files.
   -l --labels=<labels>                     Comma separated list of binning prediction names.
   --search_engine=<search_
Execution halted
cmks commented 3 years ago

You need to replace DAS_Tool in the main dir all files in src. Also make sure to install the required R packages:

R -e "install.packages(c('data.table','magrittr','docopt','rhmmer'), repos='http://cran.us.r-project.org')"

The DAS_Tool R-package is not needed anymore.

The above error indicates an issue with the command line e.g. a misspelling of the used options. Unfortunately, there is a bug in the current version of the command line parser R-package which is not returning any information about what the problem is.... However, your command looks good to me. Can you check if you're actually using the dev_customSCG branch and not the dev branch? The latter does not have the options --customDbDir and --useCustomDbOnly.

jolespin commented 2 years ago

Is this production ready?

cmks commented 2 years ago

Not quite. There are still a few bugs to be fixed and tests to be run. Schedule is to merge by the end of Feb.

AmaliT commented 1 month ago

Hi, Is there an update on this?