adamjorr / somatic-variation

Utilities for identifying somatic variants, even in reference-less species
7 stars 7 forks source link

"Fasta parsing error, RAxML expects an alignment with" in filtervcf.sh #39

Closed AlejandroMS17 closed 6 years ago

AlejandroMS17 commented 7 years ago

Hey Adam

I am trying to run filtervcf.sh and I have two problems:

Problem 1.-

datamash: transpose input error: line 2 has 24 fields (previous lines had 25);
see --help to disable strict mode
Option -T does not have any effect with the sequential or parallel MPI version.
It is used to specify the number of threads for the Pthreads-based parallelization
Warning, you specified a working directory via "-w"
Keep in mind that RAxML only accepts absolute path names, not relative ones!

RAxML can't, parse the alignment file as phylip file 
it will now try to parse it as FASTA file

TOO FEW SPECIES

Solution: adding “--no-strict” in line 19

cat $GTAB | tr -d '/' | datamash transpose --no-strict | awk '{print ">"$1; for (i=2; i<NF; i++) printf $i; print "\n"'} > $SITESFA

In this correct? or should we be using the "strict" setting?

AlejandroMS17 commented 7 years ago

Problem 2.-

Option -T does not have any effect with the sequential or parallel MPI version.
It is used to specify the number of threads for the Pthreads-based parallelization
Warning, you specified a working directory via "-w"
Keep in mind that RAxML only accepts absolute path names, not relative ones!

RAxML can't, parse the alignment file as phylip file 
it will now try to parse it as FASTA file

Fasta parsing error, RAxML expects an alignment.

I am trying to get the FASTA file only, but I had problem setting a TMP directory with the script as it is. Could you please help me to get the FASTA file first and then I can try to run the RAxML step.

David ran it in steps an managed to get the FASTA file, but when he ran the RAxML step he found out this:

"there is a problem in the fasta file. It contains '.' characters (i.e. no genotype called). Something was not quite right with the filtering"

adamjorr commented 7 years ago

This script is still under construction. I think using --no-strict should be fine. I need to edit this script or add it onto the end of another one, it doesn't work well as is.

Instead of running the script, I would instead do: bcftools filter -g 50 -i 'DP <= 500 && ExcessHet <=40' input.vcf | bcftools view -m2 -M2 -v snps -o output.vcf

Then I would use the vcf2tree to run RAxML on the output.

If you want just a fasta file without the other stuff vcf2tree does, use vcf2fa

AlejandroMS17 commented 7 years ago

Hey Adam,

Thank you for your answer. I have done this part now:

bcftools filter -g 50 -i 'DP <= 500 && ExcessHet <=40' input.vcf | bcftools view -m2 -M2 -v snps -o output.vcf

It worked well.

I have been trying to run vcf2tree the whole day, but I am having issues with the Bio module from Biopython. Any advice how to properly install Biopython, I have been trying many ways, but it's not working yet. Take a look please:

amorales@dacelo:/disks/dacelo/data/analyses/RB7_data_analysis/call_variants_RB7$ bash vcf2tree_20_06_17.sh -t 40 -r raxmlHPC -i RB7_call_variants_2.vcf -o RB7_tree.nwk -g 3
Traceback (most recent call last):
  File "diploidify.py", line 12, in <module>
    from Bio import AlignIO
ImportError: No module named Bio
Number of columns:    27

amorales@dacelo:/disks/dacelo/data/analyses/RB7_data_analysis/call_variants_RB7$ python -c "import sys; print('\n'.join(sys.path))"

/home/amorales/anaconda3/lib/python35.zip
/home/amorales/anaconda3/lib/python3.5
/home/amorales/anaconda3/lib/python3.5/plat-linux
/home/amorales/anaconda3/lib/python3.5/lib-dynload
/home/amorales/anaconda3/lib/python3.5/site-packages
/home/amorales/anaconda3/lib/python3.5/site-packages/Sphinx-1.4.6-py3.5.egg
/home/amorales/anaconda3/lib/python3.5/site-packages/multiqc-0.8-py3.5.egg
/home/amorales/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg

amorales@dacelo:/disks/dacelo/data/analyses/RB7_data_analysis/call_variants_RB7$ PythonSiteDir=$(python -c "import site; site._script()" --user-site) echo $PythonSiteDir
/home/amorales/.local/lib/python3.5/site-packages
AlejandroMS17 commented 7 years ago

Hey Adam,

I installed Biopython again, but using anaconda:

amorales@dacelo:~/anaconda3/lib/python3.5/site-packages/Bio$ ls -lh
total 740K
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Affy
drwxrwxr-x 4 amorales amorales 4.0K Jun 20 16:56 Align
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 AlignIO
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Alphabet
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Application
-rw-rw-r-- 2 amorales amorales  34K Aug 25  2016 bgzf.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Blast
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 CAPS
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Cluster
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 codonalign
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Compass
-rwxrwxr-x 2 amorales amorales  47K Oct  6  2016 cpairwise2.so
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Crystal
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Data
-rw-rw-r-- 2 amorales amorales 6.0K Aug 25  2016 DocSQL.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Emboss
drwxrwxr-x 4 amorales amorales 4.0K Jun 20 16:56 Entrez
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 ExPASy
-rw-rw-r-- 2 amorales amorales  29K Aug 25  2016 File.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 FSSP
drwxrwxr-x 7 amorales amorales 4.0K Jun 20 16:56 GA
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 GenBank
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Geo
drwxrwxr-x 4 amorales amorales 4.0K Jun 20 16:56 Graphics
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 HMM
-rw-rw-r-- 2 amorales amorales 4.9K Aug 25  2016 Index.py
-rw-rw-r-- 2 amorales amorales 3.7K Aug 25  2016 __init__.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 KDTree
drwxrwxr-x 7 amorales amorales 4.0K Jun 20 16:56 KEGG
-rw-rw-r-- 2 amorales amorales 4.2K Aug 25  2016 kNN.py
-rw-rw-r-- 2 amorales amorales 4.3K Aug 25  2016 LogisticRegression.py
-rw-rw-r-- 2 amorales amorales  23K Aug 25  2016 MarkovModel.py
-rw-rw-r-- 2 amorales amorales  11K Aug 25  2016 MaxEntropy.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Medline
drwxrwxr-x 5 amorales amorales 4.0K Jun 20 16:56 motifs
-rw-rw-r-- 2 amorales amorales 7.7K Aug 25  2016 NaiveBayes.py
drwxrwxr-x 5 amorales amorales 4.0K Jun 20 16:56 NeuralNetwork
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Nexus
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 NMR
-rw-rw-r-- 2 amorales amorales  40K Aug 25  2016 pairwise2.py
-rw-rw-r-- 2 amorales amorales 8.5K Aug 25  2016 ParserSupport.py
drwxrwxr-x 4 amorales amorales 4.0K Jun 20 16:56 Pathway
drwxrwxr-x 5 amorales amorales 4.0K Jun 20 16:56 PDB
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 phenotype
drwxrwxr-x 5 amorales amorales 4.0K Jun 20 16:56 Phylo
drwxrwxr-x 7 amorales amorales 4.0K Jun 20 16:56 PopGen
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 _py3k
drwxrwxr-x 2 amorales amorales 4.0K Jun 20 16:56 __pycache__
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Restriction
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 SCOP
drwxrwxr-x 7 amorales amorales 4.0K Jun 20 16:56 SearchIO
-rw-rw-r-- 2 amorales amorales  64K Aug 25  2016 SeqFeature.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 SeqIO
-rw-rw-r-- 2 amorales amorales  87K Aug 25  2016 Seq.py
-rw-rw-r-- 2 amorales amorales  47K Aug 25  2016 SeqRecord.py
drwxrwxr-x 4 amorales amorales 4.0K Jun 20 16:56 Sequencing
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 SeqUtils
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Statistics
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 SubsMat
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 SVDSuperimposer
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 SwissProt
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 TogoWS
-rw-rw-r-- 2 amorales amorales 2.9K Aug 25  2016 triefind.py
-rwxrwxr-x 2 amorales amorales  85K Oct  6  2016 trie.so
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 UniGene
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 UniProt
-rw-rw-r-- 2 amorales amorales 3.5K Aug 25  2016 _utils.py
drwxrwxr-x 3 amorales amorales 4.0K Jun 20 16:56 Wise
AlejandroMS17 commented 7 years ago

But I still has not been able to make it work:

amorales@dacelo:/disks/dacelo/data/analyses/RB7_data_analysis/call_variants_RB7$ bash vcf2tree_20_06_17.sh -t 40 -r raxmlHPC -i RB7_call_variants_2.vcf -o RB7_tree.nwk -g 3
Traceback (most recent call last):
  File "diploidify.py", line 12, in <module>
    from Bio import AlignIO
ImportError: No module named Bio
Number of columns:  27
roblanf commented 7 years ago

looks like a Python2/3 issue to me. Have a look on Slack Alejandro - I added some tips there...

closer every day...

On 20 June 2017 at 17:28, AlejandroMS17 notifications@github.com wrote:

But still I has not been able to make it work:

amorales@dacelo:/disks/dacelo/data/analyses/RB7_data_analysis/call_variants_RB7$ bash vcf2tree_20_06_17.sh -t 40 -r raxmlHPC -i RB7_call_variants_2.vcf -o RB7_tree.nwk -g 3 Traceback (most recent call last): File "diploidify.py", line 12, in from Bio import AlignIO ImportError: No module named Bio Number of columns: 27

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/adamjorr/somatic-variation/issues/39#issuecomment-309668530, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2pE761hMMnP9-9OFf70gmlmiyy66FTks5sF3SKgaJpZM4N-A1q .

-- Rob Lanfear Ecology, Evolution, and Genetics, The Australian National University, Canberra

www.robertlanfear.com

AlejandroMS17 commented 7 years ago

Rob was right, I was using Python 3.5.2 instead of Python 2.7

adamjorr commented 7 years ago

Reopening this to remind myself to fix filtervcf.sh

adamjorr commented 6 years ago

Closing and deprecating this script