J35P312 / FindSV

Structural variant pipeline
15 stars 5 forks source link

Problems with annotation #2

Closed geocarvalho closed 6 years ago

geocarvalho commented 7 years ago

Hello, I would like to know with you could help with some tips to start debug this problem, here is my log file:

N E X T F L O W  ~  version 0.23.3
Launching `FindSV_core.nf` [elegant_heisenberg] - revision: 0828787848
/media/Lucas-Braga/"sample-name".bam.bai
[warm up] executor > local
[b9/44c00d] Submitted process > TIDDIT ("sample-name".bam)
[f8/d97806] Submitted process > CNVnator ("sample-name".bam)
[a1/394b60] Submitted process > combine ("sample-name".bam)
[6b/22ac6b] Submitted process > annotate ("sample-name".bam)
WARN: Process `annotate ("sample-name".bam)` terminated with an error exit status (2)
 -- Error is ignored

the command was:

python FindSV.py --bam /media/sample/"sample-name".bam --output /media/cnv-out/sample/work-2/ --config ./FindSV.conf

my FindSV.conf:

process {
    //the executor, visit the nextflow website for more info
    executor = 'local'
    cpus = 1
    time = "1d"

    clusterOptions = {
        //your acount, you need not change this if you use local executor
        '-A local'
    }
}

params {
    //the output directory
    working_dir='/media/cnv-out'

    //----TIDDIT----------
    //the path of the TIDDIT excutable file
    TIDDIT_path='/home/lab/FindSV/TIDDIT/bin/TIDDIT'
    //minimum number of discordant pairs for calling a variant
    TIDDIT_pairs=6
    //lowest mapping quality of a discordant pair
    TIDDIT_q=10

    //---------CNVnator--------
    //path to the cnvnator executable
    CNVnator_path='/home/lab/CNVnator_v0.3.3/src/cnvnator'
    //path to the thisroot.sh script, you will find it in the bin directory of the compiled root library
    //skip this variable if rootsys already is added to path
    thisroot_path='/home/lab/root-6.08.00/bin/thisroot.sh'
    //path to the folder containing reference fasta files split per chromosmoe
    CNVnator_reference_dir_path='/home/lab/exome-ref'
    //the path to the cnvnator2vcf script
    CNVnator2vcf_path='/home/lab/CNVnator_v0.3.3/cnvnator2VCF.pl'
    //bin size of cnvnator, generally, small bin size leads to high sensitivity and worse precision, and the other way around
    CNVnator_bin_size='1000'
    //--------manta--------
    //fermikit files
    configManta='/bin/configManta.py'
    RunManta='TRUE'

    //-----internal----------
    //contig sort path, this script i located in the FindSV internal_scripts folder
    contig_sort_path='/home/lab/FindSV/internal_scripts/contigSort.py'
    clear_vep_path='/home/lab/FindSV/internal_scripts/clear_vep.py'
    cleanVCF_path='/home/lab/FindSV/internal_scripts/cleanVCF.py'
    the_annotator_path='/home/lab/FindSV/internal_scripts/the_annotator.py'
    gene_keys_dir_path='/home/lab/FindSV/gene_keys'
    frequency_filter_path='/home/lab/FindSV/internal_scripts/frequency_filter.py'
    FindSV_home='/home/lab/FindSV'

    //----------reference--------
    //path to reference fasta file, indexed using bwa, and samtools 0.19
    genome='/media/hg19/ucsc.hg19.fasta'

    //-----VEP_-------------
    //path to the vep script
    VEP_path='variant_effect_predictor.pl'

    //------SVDB---------
    //The path to the SVDB script
    SVDB_script_path='/home/lab/FindSV/SVDB/SVDB.py'
    //path to the sqlite database of SVDB
    SVDB_path='""'

    //vcf files containing known pathogenic/benign variants
    pathogenic_db_path='""'
    pathogenic_db_overlap="0.6"
    pathogenic_db_distance="50000"
    benign_db_path='""'
    benign_db_overlap="0.6"
    benign_db_distance="50000"

    //overlap to consider two variants the same
    SVDB_overlap='0.6'
    //maximum distance between two breakpoints
    SVDB_distance='10000'
    //All variants above this frequencies will be cleared from the final output vcf
    SVDB_limit='0.1'

    //-------GENMOD------------
    //the path to the gnemod ini file

    genmod_rank_model_path='/home/lab/FindSV/genmod_SV.txt'

}

trace {
    fields = 'task_id,hash,name,status,tag'
}

Another problem, I changed the RunManta variable to "TRUE" to use Manta, but it didn't work. How can I run Manta? Sorry if my problem is basic, but I can't see what is happening in the error message.

Best regards, George.

J35P312 commented 7 years ago

Hello! Sorry about the Manta issue, I noticed there was a bug making it impossible to turn on the manta module. The Manta module is still undergoing development, but it shoudl atleast run now.

If you pull the latest commit and restart your sample you should get similar output:

[2c/1cf7d3] Submitted process > Manta (cancer_sample.bam) [e0/bec4a4] Submitted process > TIDDIT (cancer_sample.bam) [7e/b1854f] Submitted process > CNVnator (cancer_sample.bam) [8f/17e966] Submitted process > combine (cancer_sample.bam) [10/744fd6] Submitted process > annotate (cancer_sample.bam)

To debug the annotation problem, you can enter the FindSV directory, from there, you should have a "work" directory, which is generated from nextflow. like this:

/home/jesperei/FindSV-1_2017/FindSV /home/jesperei/FindSV-1_2017/FindSV/work

Within the work directory, you will find a directory for each job submited, to find the annotation job directory, you can use the hash from the output log: [10/744fd6] Submitted process > annotate (cancer_sample.bam) the 10/744fd6 is part of the directory name, if you cd 10/744fd6 and complete it using tab, you will find the directory: /home/jesperei/FindSV-1_2017/FindSV/work/10/744fd64de5c4fb6d7917455afb6111 within that directory you will find an error log file: .command.err and command output file: .command.out

The .command.err file should contain the error. Feel free to post it here! Good luck!

geocarvalho commented 7 years ago

Hey! Thanks for the help. I searched for the "command.out" and "command.err" for my case, no files..

N E X T F L O W  ~  version 0.23.3
Launching `FindSV_core.nf` [elegant_heisenberg] - revision: 0828787848
/media/"sample-name"/"sample-name".bam.bai
[warm up] executor > local
[b9/44c00d] Submitted process > TIDDIT ("sample-name".bam)
[f8/d97806] Submitted process > CNVnator ("sample-name".bam)
[a1/394b60] Submitted process > combine ("sample-name".bam)
[6b/22ac6b] Submitted process > annotate ("sample-name".bam)
WARN: Process `annotate ("sample-name".bam)` terminated with an error exit status (2)
 -- Error is ignored
lab@6a8a718405b0:~/FindSV/work/6b/22ac6bdd64ddad4911768417fa46f7$ ls
"sample-name".bam  "sample-name".bam.bai  "sample-name"_CombinedCalls.vcf

Frist I reinstalled VEP using git instead conda and after I tried another run that I had the same problem but now I can run Manta:

(FindSV_env) lab@6a8a718405b0:~/FindSV$ python FindSV.py --bam /media/"sample-name"/"sample-name".bam --output /media/cnv-out/"sample-name"/work-7 --config ./FindSV.conf
DONE
Processing, please do not turn FindSV off
SAMPLE_ID:/media/"sample-name"/"sample-name".bam
N E X T F L O W  ~  version 0.23.3
Launching `FindSV_core.nf` [evil_dubinsky] - revision: efecb129cf
/media/"sample-name"/"sample-name".bam.bai
[warm up] executor > local
[b9/6b6cc9] Submitted process > TIDDIT ("sample-name".bam)
[a9/900055] Submitted process > Manta ("sample-name".bam)
[e4/4d9031] Submitted process > CNVnator ("sample-name".bam)
ls
[88/e1f6b9] Submitted process > combine ("sample-name".bam)
[80/e4be23] Submitted process > annotate ("sample-name".bam)
WARN: Process `annotate ("sample-name".bam)` terminated with an error exit status (1) -- Error is ignored
"sample-name"
FAILED:ANNOTATION
DONE
(FindSV_env) lab@6a8a718405b0:~/FindSV$ ls work/80/e4be23c77fae458402ebaff0bab67f/"sample-name".bam  "sample-name".bam.bai  "sample-name"_CombinedCalls.vcf

And still without errors' log :/ PS.: The final vcf combined is awesome! Another question: Can I use your script for exome ? Cheers, George.

J35P312 commented 7 years ago

No problems, and happy to hear that Manta is working =P.

The error log is a bit tricky, it's a hidden file:

[jesperei@milou2 FindSV]$ ls work/dd/ab0691319e81436e2a3364a9a0361a/: P4855_102.clean.dedup.bai P4855_102.clean.dedup_CombinedCalls.vcf P4855_102.clean.dedup_CombinedCalls.vcf.tmp_warnings.txt P4855_102.clean.dedup.bam P4855_102.clean.dedup_CombinedCalls.vcf.tmp_summary.html P4855_102.clean.dedup_FindSV.vcf

however, if you type ls like this: ls work/dd/ab0691319e81436e2a3364a9a0361a/.* you will find it:

[jesperei@milou2 FindSV]$ ls work/dd/ab0691319e81436e2a3364a9a0361a/.* work/dd/ab0691319e81436e2a3364a9a0361a/.command.begin work/dd/ab0691319e81436e2a3364a9a0361a/.command.out work/dd/ab0691319e81436e2a3364a9a0361a/.command.sh work/dd/ab0691319e81436e2a3364a9a0361a/.command.err work/dd/ab0691319e81436e2a3364a9a0361a/.command.run work/dd/ab0691319e81436e2a3364a9a0361a/.command.trace

work/dd/ab0691319e81436e2a3364a9a0361a/.command.log work/dd/ab0691319e81436e2a3364a9a0361a/.command.run.1 work/dd/ab0691319e81436e2a3364a9a0361a/.exitcode

work/dd/ab0691319e81436e2a3364a9a0361a/.: P4855_102.clean.dedup.bai P4855_102.clean.dedup_CombinedCalls.vcf P4855_102.clean.dedup_CombinedCalls.vcf.tmp_warnings.txt P4855_102.clean.dedup.bam P4855_102.clean.dedup_CombinedCalls.vcf.tmp_summary.html P4855_102.clean.dedup_FindSV.vcf

or try to open it directly using nano: nano work/dd/ab0691319e81436e2a3364a9a0361a/.command.err

From your pasted ls output, it seems that it is vep casuing the troubles. Could it be the VEP version? I have been using version 84, according to the log version 87 should be mostly backwards compatible, but you never now.. Maybe it is time I change the command line(I have used the same command since release 81 more or less =P).

Also, FindSV issues this command to run VEP: ${VEP_exec_file} --cache --force_overwrite --poly b -i ${vcf_file} -o ${vcf_file}.tmp --buffer_size 5 --port 3337 --vcf --per_gene --format vcf -q

if you activate the FindSV_env and run it on your combined vcf file, you should get the same results as FindSV(use the path/variable of the vepvscript instead of ${VEP_exec_file}, and replace ${vcf_file} to your vcf file). I would not recommend running it on exome data. I have been running FindSV on some some panels and extracted bam regions, I found the variants i wanted to find, but I got ugly results, CNVnator will call deletions across every region not within the panel, and TIDDITs filters will act a bit crazy. If you try using it on exome data you would need to exclude regions manually, and skip using the filter flag of the vcf.

Happy to hear that you got the combined vcf file, and good luck with the analysis! Have a nice weekend!

geocarvalho commented 7 years ago

Thanks a lot, I think it will work well now that I could see the error message. When the run finish I come back make my report about the annotation.

Again, thank you for the great work here. Have a nice carnival!

geocarvalho commented 7 years ago

Hello again! I had this problem:

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
        LANGUAGE = (unset),
        LC_ALL = (unset),
        LC_CTYPE = "pt_BR.UTF-8",
        LANG = "pt_BR.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

-------------------- EXCEPTION --------------------
MSG: ERROR: Cannot index bgzipped FASTA file with Bio::DB::Fasta

STACK Bio::EnsEMBL::Variation::Utils::FastaSequence::setup_fasta /home/lab/ensembl-vep/Bio/EnsEMBL/Variation/Utils/FastaSequence.pm:210
STACK Bio::EnsEMBL::VEP::BaseVEP::fasta_db /home/lab/ensembl-vep/modules/Bio/EnsEMBL/VEP/BaseVEP.pm:288
STACK Bio::EnsEMBL::VEP::Runner::init /home/lab/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:83
STACK Bio::EnsEMBL::VEP::Runner::run /home/lab/ensembl-vep/modules/Bio/EnsEMBL/VEP/Runner.pm:111
STACK toplevel /home/lab/ensembl-vep/vep.pl:193
Date (localtime)    = Fri Feb 24 20:44:41 2017
Ensembl API version = 87
---------------------------------------------------

I saw this conversation and I think that my reference genome is the problem, what genome are you using? My genome is from Broad (my samtools is 1.3)

Another question, what script has this code: ${VEP_exec_file} --cache --force_overwrite --poly b -i ${vcf_file} -o ${vcf_file}.tmp --buffer_size 5 --port 3337 --vcf --per_gene --format vcf -q I would like to add my reference genome (--fasta) and some maf informations (like: --af, --af_1kg, --af_exac). You think that add this maf information is unnecessary or you add some information like that (I'm considering the "frequency_filter.py" script) ? Cause I want CNVs relevants.

Best, George.

dnil commented 7 years ago

That's likely a VEP issue. Looks like something I've got when the reference genome is unindexed and located somewhere where I don't have write permissions. Fixing either should solve the problem! Cheers!

J35P312 commented 7 years ago

Hello there! We have been using hg19 the standard ucsc reference, as well as the thousand genome GRCh37 reference (with and without decoy). I have never seen such error myself! I think Daniels comment sounds reasonable.

I made a quick fix of the VEP command( It was needed anyway to allow different species etc): ${VEP_exec_file} -i ${vcf_file} -o ${vcf_file}.tmp ${params.vep_args}

where ${params.vep_args} is all the optional arguments for vep: vep_args="--cache --force_overwrite --poly b --buffer_size 5 --port 3337 --vcf --per_gene --format vcf -q"

If you pull the newest version of findSV and paste that line on line 60 of your config file, you can add or remove any vep option as you like. Like this:

original: //-----VEP_------------- //path to the vep script VEP_path='variant_effect_predictor.pl'

updated: //-----VEP_------------- //path to the vep script VEP_path='variant_effect_predictor.pl' vep_args="--cache --force_overwrite --poly b --buffer_size 5 --port 3337 --vcf --per_gene --format vcf -q"

you can also rerun the setup.py script, but this will probably take more time than pasting the line yourself =P. Are you using hg38? I think you should remove the --port 3337 then.

I have never tried the --af options, I'm not sure it will work on structural variants, but it could be worth trying! We have made our own internal database using SVDB https://github.com/J35P312/SVDB (set the SVDB_path='""' to your exported vcf database file). You could also try using the thousand genomes SV vcf, but I have found it to be not as efficient as our own samples.

Good luck! //Jesper

geocarvalho commented 7 years ago

Thanks for the help guys! My problem with vep was solved when I gunzip the genome "$HOME/.vep/homo_sapiens/86_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz", after that everything happened as expected (I don't know if was my perl ou samtools version). Nice update, now I can try new args. I tried args for population databases but didn't work for this variants. I will see SVDB and how it could be used here.

PS.: Is there any chance that your genome database be available in the future?

Thanks for the work! Best, George.

dnil commented 6 years ago

A db with frequencies for 1000 swedes is now available for download. See https://swefreq.nbis.se. You will need to register & apply for access - all clickable.. Cheers!