faircloth-lab / phyluce

software for UCE (and general) phylogenomics
http://phyluce.readthedocs.org/
Other
78 stars 49 forks source link

Missing executables from snp phasing workflow #101

Closed ericcrandall closed 6 years ago

ericcrandall commented 6 years ago

I am quite interested in phasing the SNPs in my dataset per Andermann et al. 2018 but as I've followed the instructions I found that phyluce_snp_bwa_multiple_align and phyluce_snp_phase_uces were missing from the /bin folder. I copied these from GitHub and got the first one to work, but phylum_snp_phase_uces is trying to import seqtk and not finding it.

Has this portion of the workflow been abandoned or deprecated?

brantfaircloth commented 6 years ago

The workflow is present in the repository, but not entirely integrated to the anaconda package, yet, because of some issues i've been (slowly) trying to work through. So, you may need to download seqtk and add it's location to ~/.phyluce.conf like:

[seqtk]
seqtk:$CONDA/bin/seqtk
ericcrandall commented 6 years ago

OK - it's the ~/.phyluce.conf that is throwing me. I can't find it in my home directory (even searching for invisible files), nor when I search the entire anaconda2 directory.

Thanks again for your help.

brantfaircloth commented 6 years ago

you should be able to create the file. in your $HOME.

ericcrandall commented 6 years ago

OK - I made that file with the suggested text, and it is still not finding seqtk. I have placed seqtk executables into ~/anaconda2/envs/phyluce/bin and ~/anaconda2/pkgs/phyluce-1.5.0-py27_0/bin Is there some other place I need to put it, or another config file I can edit? Sorry for the hassle.

brantfaircloth commented 6 years ago

What do the contents of the file look like (it should be at ~/.phyluce.conf)? If you used the tilda (~) in the path to seqtk, go ahead and expand the paths (or use an environment variable like seqtk:$HOME/anaconda2/envs/phyluce/bin/seqtk).

ericcrandall commented 6 years ago

Contents are just what you gave to me above.

(phyluce) M053-303-26932:~ cran5048$ pwd
/Users/cran5048
(phyluce) M053-303-26932:~ cran5048$ cat .phyluce.conf
[seqtk]
seqtk:$CONDA/bin/seqtk

I tried changing it to

[seqtk]
seqtk:/Users/cran5048/anaconda2/envs/phyluce/bin/seqtk

to no avail.

ericcrandall commented 6 years ago

Specific error is this:

(phyluce) M053-303-26932:phyluce cran5048$ phyluce_snp_phase_uces     --config ./sebova/sebova_phasing_conf.txt     --bams ./sebova/mapping     --output ./sebova/phased_reads
Traceback (most recent call last):
  File "/Users/cran5048/anaconda2/envs/phyluce/bin/phyluce_snp_phase_uces", line 26, in <module>
    from phyluce import seqtk
ImportError: cannot import name seqtk
brantfaircloth commented 6 years ago

Ok. Gotcha. This is a different issue than the binary for seqtk being found. I think what's happened is that you have the package for phyluce, but that's an older version. You added in the "binaries" for the needed functions from github by copying the files as you mentioned (i glanced over this part). But, you're not getting all of the bits, because there are also a number of other, phyluce modules that go along with those scripts, e.g. changes to phyluce/phyluce/samtools as well as the creation of phyluce/phyluce/seqtk. You don't have these files because they're not in the conda version that's currently available. I also realize I need to update what's on conda, but there are some blocking changes that I haven't entirely figured out how to work around (e.g. Trinity not being buildable on osx).

Your best bet to get all of the phyluce pieces that you need is to checkout, using git, the entire phyluce repository somewhere in your $HOME. You will probably need to alter your $PYTHONPATH to get this to work, e.g.:

export PYTHONPATH=$HOME/git/phyluce

Then, you should be able to run git using

python ~/path/to/phyluce/bin/snps/phyluce_snp_phase_uces

ericcrandall commented 6 years ago

OK - thanks that worked. Yeah, I was having difficulty installing Trinity to my OSX - glad its not just me.

But now there is a new issue. Let me know if these aren't issues you are quite ready to tackle yet. I understand this is still under development. Error looks like:

(phyluce) M053-303-26932:phyluce cran5048$ python ~/github/phyluce/bin/snps/phyluce_snp_phase_uces \
>     --config ./sebova/sebova_phasing_conf.txt \
>     --bams ./sebova/mapping \
>     --output ./sebova/phased_reads
2018-04-10 13:07:44,723 - phyluce_snp_phase_uces - INFO - ================ Starting phyluce_snp_phase_uces ================
2018-04-10 13:07:44,723 - phyluce_snp_phase_uces - INFO - Version: git e3e2f09
2018-04-10 13:07:44,723 - phyluce_snp_phase_uces - INFO - Argument --bams: /Users/cran5048/BaseSpace/UCE_Fish_2/phyluce/sebova/mapping
2018-04-10 13:07:44,723 - phyluce_snp_phase_uces - INFO - Argument --config: /Users/cran5048/BaseSpace/UCE_Fish_2/phyluce/sebova/sebova_phasing_conf.txt
2018-04-10 13:07:44,724 - phyluce_snp_phase_uces - INFO - Argument --conservative: False
2018-04-10 13:07:44,724 - phyluce_snp_phase_uces - INFO - Argument --cores: 1
2018-04-10 13:07:44,724 - phyluce_snp_phase_uces - INFO - Argument --log_path: None
2018-04-10 13:07:44,724 - phyluce_snp_phase_uces - INFO - Argument --output: /Users/cran5048/BaseSpace/UCE_Fish_2/phyluce/sebova/phased_reads
2018-04-10 13:07:44,724 - phyluce_snp_phase_uces - INFO - Argument --verbosity: INFO
2018-04-10 13:07:44,724 - phyluce_snp_phase_uces - INFO - ================ Starting phyluce_snp_phase_uces ================
2018-04-10 13:07:44,745 - phyluce_snp_phase_uces - INFO - Getting input filenames and creating output directories
2018-04-10 13:07:44,867 - phyluce_snp_phase_uces - INFO - ----------------- Processing sebova_sbc_R010966 -----------------
2018-04-10 13:07:44,867 - phyluce_snp_phase_uces - INFO - Phasing BAM file for sebova_sbc_R010966
2018-04-10 13:07:45,986 - phyluce_snp_phase_uces - INFO - Sorting BAM for sebova_sbc_R010966
2018-04-10 13:07:46,017 - phyluce_snp_phase_uces - INFO - Sorting BAM for sebova_sbc_R010966
2018-04-10 13:07:46,025 - phyluce_snp_phase_uces - INFO - Creating REF/ALT allele FASTQ file 0
Traceback (most recent call last):
  File "/Users/cran5048/github/phyluce/bin/snps/phyluce_snp_phase_uces", line 289, in <module>
    main()
  File "/Users/cran5048/github/phyluce/bin/snps/phyluce_snp_phase_uces", line 271, in main
    fastas = phase_bam(log, individual, reference, bam, args.output)
  File "/Users/cran5048/github/phyluce/bin/snps/phyluce_snp_phase_uces", line 99, in phase_bam
    make_cons_0 = samtools.call(log, sample, phasing_out_dir, reference, sort_phased_0, "0")
  File "/Users/cran5048/github/phyluce/phyluce/samtools.py", line 78, in call
    get_user_path("samtools", "bcftools"),
  File "/Users/cran5048/github/phyluce/phyluce/pth.py", line 29, in get_user_path
    pth = config.get(program, binary)
  File "/Users/cran5048/anaconda2/envs/phyluce/lib/python2.7/ConfigParser.py", line 618, in get
    raise NoOptionError(option, section)
ConfigParser.NoOptionError: No option 'bcftools' in section: 'samtools'
brantfaircloth commented 6 years ago

Yes, your config is now also missing this option. Here's what my ~/.phyluce.conf looks like (and most of the binaries you need should already be in place by installing the phyluce dependencies).

Also not a problem to help get working (and I apologize for difficulty). I need to rejigger for a number of things that have changed about conda and also to start using biopython for dependencies, but I don't have lots of time to do that right now (and bugaboos like missing trinity will inevitable break the workflow for some folks - so i'm also working on implementing another, hopefully better assembly option).

[abyss]
abyss:$CONDA/bin/ABYSS
abyss-pe:$CONDA/bin/abyss-pe

[bowtie]
bowtie:$CONDA/bin/bowtie

[bwa]
bwa:$CONDA/bin/bwa

[gblocks]
gblocks:$CONDA/bin/gblocks

[trimal]
trimal:$HOME/git/trimal/source/trimal

[java]
executable:java
mem:-Xmx46g
jar:$CONDA/jar
gatk:GenomeAnalysisTKLite.jar

[lastz]
lastz:$CONDA/bin/lastz

[mafft]
mafft:$CONDA/bin/mafft

[muscle]
muscle:$CONDA/bin/muscle

[raxml]
raxmlHPC-SSE3:$CONDA/bin/raxmlHPC-SSE3
raxmlHPC-PTHREADS-SSE3:$CONDA/bin/raxmlHPC-PTHREADS-SSE3

[samtools]
samtools:$CONDA/bin/samtools
bcftools:$CONDA/bin/bcftools
vcfutils:$CONDA/bin/vcfutils.pl

[seqtk]
seqtk:$HOME/bin/seqtk

[trinity]
#trinity:$HOME/bin/trinity
trinity:$HOME/src/trinityrnaseq_r2013-02-25/Trinity.pl
#max_memory:46G
kmer_coverage:2
jellyfish_memory:36G

[velvet]
velvetg:$CONDA/bin/velvetg
velveth:$CONDA/bin/velveth

#----------------
#    Advanced
#----------------

[headers]
trinity:comp\d+_c\d+_seq\d+|c\d+_g\d+_i\d+|TR\d+\|c\d+_g\d+_i\d+
velvet:node_\d+
ericcrandall commented 6 years ago

OK - that worked. Thanks for your help and all your work on the pipeline!

One last question or questions - what exactly is being phased? Did the pipeline call SNPs at any point, and if so where? My output joined_allele_sequences_all_samples.fasta is a small subset of the loci that were mapped, and most of the phased haplotypes do not have any variant sites.

If I wanted to call SNPs myself, I could start with the .bam files that are produced at the end of the mapping right? And the reference fasta would be the one produced by phyluce_assembly_get_fastas_from_match_counts?

brantfaircloth commented 6 years ago

Error generally fixed in v1.6.2 and build that will be up on bioconda shortly. To answer @ericcrandall - the SNPs are called jointly in the samtools phase part of the code, then output as two alleles. As to why you have so few phased loci, I am not sure. As I work on general fixes (and porting to python3), I'll try to generate a variety of test cases for this workflow.

Oh, and to call SNPs yourself, you use the BAM files generated during read-mapping (e.g. the bwa_multiple_align). Then, you can pick your SNP calling approach (e.g. samtools or GATK, etc.). The reference fasta would be the assemblies you choose to map your reads against.

gushiro commented 4 years ago

I do not get how the reads are called if the coverage is low. Let's say the coverage is 2X, each read with a different mutation for the same site, what would be the nucleotide in the resulting site of the loci?

brantfaircloth commented 4 years ago

There are several descriptions on google regarding how samtools phases - please search there.