TheJacksonLaboratory / SVE

GNU General Public License v3.0
51 stars 12 forks source link

FusorSV for GRCh38 #10

Closed lh12565 closed 5 years ago

lh12565 commented 6 years ago

Hi, I found your reference genome of truth sets is GRCh37. And, the reference genome that I used for mapping the reads and generating the VCF file was GRCh38. Can I use your model (default.pickle) to detect SV and merge calls?

Thanks! Luo Hao

lslochov commented 6 years ago

Hi, the model file's data is independent of the reference genome of the VCFs used to train the model. You should be fine to proceed with the default model file.

lslochov commented 6 years ago

If your input VCFs to FusorSV used any reference other than hg19, then you should use the development version of FusorSV, as the master branch only works with hg19. Also, not all the SV callers included with SVE will work with hg38, because the developers of those tools have not implemented that feature yet.

MaestSi commented 6 years ago

Dear lslochov, I tried using the FusorSV version in the dev branch; however, the output VCF file produced was empty. This is the command I gave:

python /mnt/cifs01/simone/software/SVE/scripts/FusorSV/FusorSV.py -f /mnt/cifs01/simone/software/SVE/scripts/FusorSV/data/models/default.pickle -c chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM -r /mnt/cifs01/simone/NA12878/test_FusorSV/Homo_sapiens_assembly38.fasta -L /mnt/cifs01/simone/software/SVE/data/hg19ToHg38.over.chain.gz -i /mnt/cifs01/simone/NA12878/test_FusorSV/VCF_files/ -p 24 -o /mnt/cifs01/simone/NA12878/test_FusorSV/FusorSV_output

And this was the std output I had:

/mnt/cifs01/simone/software/miniconda2/bin/python /mnt/cifs01/simone/software/SVE/scripts/FusorSV/FusorSV.py -f /mnt/cifs01/simone/software/SVE/scripts/FusorSV/data/models/default.pickle -c chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM -r /mnt/cifs01/simone/NA12878/test_FusorSV/Homo_sapiens_assembly38.fasta -L /mnt/cifs01/simone/software/SVE/data/hg19ToHg38.over.chain.gz -i /mnt/cifs01/simone/NA12878/test_FusorSV/VCF_files/ -p 24 -o /mnt/cifs01/simone/NA12878/test_FusorSV/FusorSV_output no contig directory specified using default stage id exclude list:[1, 36] processing samples ['/mnt/cifs01/simone/NA12878/test_FusorSV/VCF_files/NA12878'] for chroms ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM'] merging the svmask regions svmask regions merged in 0.0 sec reading, parsing, partitioning and writing sample VCFs reading sample NA12878 finished reading 1 out of 1 samples generating 0 partitions in 4.44 sec starting posterior estimate on partition: t=0 b=0 starting posterior estimate on partition: t=0 b=1 starting posterior estimate on partition: t=0 b=2 starting posterior estimate on partition: t=0 b=3 starting posterior estimate on partition: t=0 b=4 starting posterior estimate on partition: t=0 b=5 starting posterior estimate on partition: t=0 b=6 starting posterior estimate on partition: t=0 b=7 starting posterior estimate on partition: t=0 b=8 starting posterior estimate on partition: t=1 b=0 starting posterior estimate on partition: t=1 b=1 starting posterior estimate on partition: t=1 b=2 starting posterior estimate on partition: t=1 b=3 starting posterior estimate on partition: t=2 b=0 starting posterior estimate on partition: t=2 b=1 starting posterior estimate on partition: t=2 b=2 starting posterior estimate on partition: t=2 b=3 starting posterior estimate on partition: t=2 b=4 starting posterior estimate on partition: t=2 b=5 starting posterior estimate on partition: t=2 b=6 starting posterior estimate on partition: t=2 b=7 starting posterior estimate on partition: t=2 b=8 starting posterior estimate on partition: t=2 b=9 starting posterior estimate on partition: t=2 b=10 posterior estimate on partition: t=2 b=7 169.1 sec alpha=0.518215864037 posterior estimate on partition: t=0 b=0 176.56 sec alpha=1.0 posterior estimate on partition: t=0 b=6 175.28 sec alpha=1.0 posterior estimate on partition: t=0 b=3 177.36 sec alpha=1.0 posterior estimate on partition: t=1 b=3 175.42 sec alpha=1.0 posterior estimate on partition: t=2 b=0 176.92 sec alpha=1.0 posterior estimate on partition: t=0 b=4 180.1 sec alpha=1.0 posterior estimate on partition: t=1 b=1 180.0 sec alpha=1.0 posterior estimate on partition: t=0 b=1 184.54 sec alpha=1.0 posterior estimate on partition: t=2 b=4 181.39 sec alpha=0.52542302131 posterior estimate on partition: t=0 b=5 185.27 sec alpha=1.0 posterior estimate on partition: t=2 b=9 181.95 sec alpha=0.519560482193 posterior estimate on partition: t=0 b=2 187.68 sec alpha=1.0 posterior estimate on partition: t=0 b=8 186.24 sec alpha=1.0 posterior estimate on partition: t=2 b=2 185.34 sec alpha=0.0293786326107 posterior estimate on partition: t=1 b=0 187.01 sec alpha=3.4341838662e-05 posterior estimate on partition: t=2 b=8 185.04 sec alpha=0.502344730447 posterior estimate on partition: t=0 b=7 188.62 sec alpha=1.0 posterior estimate on partition: t=2 b=10 185.26 sec alpha=0.613613159937 posterior estimate on partition: t=2 b=1 187.75 sec alpha=1.0 posterior estimate on partition: t=2 b=5 186.95 sec alpha=0.499862231954 posterior estimate on partition: t=2 b=6 187.78 sec alpha=0.383344526714 posterior estimate on partition: t=1 b=2 191.58 sec alpha=1.0 starting posterior estimate on partition: t=2 b=11 posterior estimate on partition: t=2 b=3 190.66 sec alpha=0.34937267559 starting posterior estimate on partition: t=2 b=12 starting posterior estimate on partition: t=2 b=13 starting posterior estimate on partition: t=2 b=14 starting posterior estimate on partition: t=2 b=15 starting posterior estimate on partition: t=2 b=16 starting posterior estimate on partition: t=3 b=0 posterior estimate on partition: t=2 b=11 91.24 sec alpha=0.617913396395 starting posterior estimate on partition: t=3 b=1 posterior estimate on partition: t=2 b=12 95.53 sec alpha=0.556665831188 starting posterior estimate on partition: t=3 b=2 posterior estimate on partition: t=2 b=13 99.47 sec alpha=0.719264343745 starting posterior estimate on partition: t=3 b=3 posterior estimate on partition: t=2 b=14 98.53 sec alpha=0.733482953972 starting posterior estimate on partition: t=3 b=4 posterior estimate on partition: t=2 b=15 100.54 sec alpha=0.403955199071 starting posterior estimate on partition: t=3 b=5 posterior estimate on partition: t=2 b=16 97.95 sec alpha=0.248096526166 starting posterior estimate on partition: t=3 b=6 posterior estimate on partition: t=3 b=0 96.47 sec alpha=1.0 starting posterior estimate on partition: t=3 b=7 posterior estimate on partition: t=3 b=1 100.83 sec alpha=1.0 starting posterior estimate on partition: t=4 b=0 posterior estimate on partition: t=3 b=2 99.62 sec alpha=1.0 starting posterior estimate on partition: t=4 b=1 posterior estimate on partition: t=3 b=3 100.57 sec alpha=0.0655191344612 starting posterior estimate on partition: t=4 b=2 posterior estimate on partition: t=3 b=4 98.36 sec alpha=0.129861687173 starting posterior estimate on partition: t=4 b=3 posterior estimate on partition: t=3 b=5 97.8 sec alpha=0.128995129137 starting posterior estimate on partition: t=4 b=4 posterior estimate on partition: t=3 b=6 100.16 sec alpha=0.102543074863 starting posterior estimate on partition: t=4 b=5 posterior estimate on partition: t=3 b=7 100.56 sec alpha=0.0339046220974 starting posterior estimate on partition: t=4 b=6 posterior estimate on partition: t=4 b=0 101.12 sec alpha=1.0 starting posterior estimate on partition: t=4 b=7 posterior estimate on partition: t=4 b=1 100.34 sec alpha=1.0 posterior estimate on partition: t=4 b=2 95.76 sec alpha=1.0 starting posterior estimate on partition: t=4 b=8 starting posterior estimate on partition: t=5 b=0 posterior estimate on partition: t=4 b=3 99.06 sec alpha=1.0 posterior estimate on partition: t=4 b=4 96.19 sec alpha=1.0 starting posterior estimate on partition: t=5 b=1 posterior estimate on partition: t=4 b=5 96.97 sec alpha=1.0 starting posterior estimate on partition: t=5 b=2 posterior estimate on partition: t=4 b=6 97.35 sec alpha=1.0 starting posterior estimate on partition: t=5 b=3 posterior estimate on partition: t=4 b=7 98.43 sec alpha=1.0 starting posterior estimate on partition: t=5 b=4 posterior estimate on partition: t=4 b=8 97.38 sec alpha=1.0 starting posterior estimate on partition: t=5 b=5 posterior estimate on partition: t=5 b=0 97.38 sec alpha=1.0 starting posterior estimate on partition: t=5 b=6 posterior estimate on partition: t=5 b=1 95.11 sec alpha=1.0 starting posterior estimate on partition: t=5 b=7 posterior estimate on partition: t=5 b=2 94.47 sec alpha=1.0 starting posterior estimate on partition: t=5 b=8 posterior estimate on partition: t=5 b=3 95.04 sec alpha=1.0 posterior estimate on partition: t=5 b=4 93.25 sec alpha=0.155508169481 starting posterior estimate on partition: t=5 b=9 posterior estimate on partition: t=5 b=5 91.99 sec alpha=0.37861822495 starting posterior estimate on partition: t=5 b=10 starting posterior estimate on partition: t=5 b=11 posterior estimate on partition: t=5 b=6 92.77 sec alpha=1.0 starting posterior estimate on partition: t=5 b=12 posterior estimate on partition: t=5 b=7 92.65 sec alpha=0.117773965802 posterior estimate on partition: t=5 b=8 91.54 sec alpha=1.0 starting posterior estimate on partition: t=5 b=13 posterior estimate on partition: t=5 b=9 91.92 sec alpha=0.531572093233 starting posterior estimate on partition: t=5 b=14 posterior estimate on partition: t=5 b=10 93.34 sec alpha=0.999245007297 posterior estimate on partition: t=5 b=11 92.37 sec alpha=1.0 starting posterior estimate on partition: t=6 b=0 posterior estimate on partition: t=5 b=12 91.84 sec alpha=1.0 starting posterior estimate on partition: t=6 b=1 starting posterior estimate on partition: t=6 b=2 posterior estimate on partition: t=5 b=13 91.16 sec alpha=1.0 posterior estimate on partition: t=5 b=14 89.84 sec alpha=1.0 posterior estimate on partition: t=6 b=0 87.6 sec alpha=1.0 starting posterior estimate on partition: t=6 b=3 posterior estimate on partition: t=6 b=1 85.8 sec alpha=1.0 starting posterior estimate on partition: t=6 b=4 posterior estimate on partition: t=6 b=2 83.61 sec alpha=1.0 starting posterior estimate on partition: t=6 b=5 posterior estimate on partition: t=6 b=3 91.75 sec alpha=1.0 posterior estimate on partition: t=6 b=4 83.89 sec alpha=1.0 starting posterior estimate on partition: t=6 b=6 starting posterior estimate on partition: t=6 b=7 posterior estimate on partition: t=6 b=5 86.15 sec alpha=1.0 posterior estimate on partition: t=6 b=6 92.48 sec alpha=1.0 posterior estimate on partition: t=6 b=7 82.96 sec alpha=1.0 starting posterior estimate on partition: t=6 b=8 posterior estimate on partition: t=6 b=8 77.83 sec alpha=1.0 starting posterior estimate on partition: t=7 b=0 starting posterior estimate on partition: t=7 b=1 posterior estimate on partition: t=7 b=0 84.92 sec alpha=1.0 starting posterior estimate on partition: t=7 b=2 posterior estimate on partition: t=7 b=1 82.34 sec alpha=1.0 starting posterior estimate on partition: t=7 b=3 posterior estimate on partition: t=7 b=2 83.08 sec alpha=1.0 posterior estimate on partition: t=7 b=3 76.81 sec alpha=1.0 starting posterior estimate on partition: t=7 b=4 posterior estimate on partition: t=7 b=4 79.8 sec alpha=1.0 starting posterior estimate on partition: t=7 b=5 starting posterior estimate on partition: t=7 b=6 posterior estimate on partition: t=7 b=5 96.8 sec alpha=1.0 starting posterior estimate on partition: t=7 b=7 posterior estimate on partition: t=7 b=6 89.0 sec alpha=1.0 posterior estimate on partition: t=7 b=7 83.11 sec alpha=1.0 starting posterior estimate on partition: t=7 b=8 posterior estimate on partition: t=7 b=8 80.92 sec alpha=1.0

finished estimation in 5095.83 sec apply fusion model to sample inputs and generating fusorSV ouput starting fusorSV discovery on sample NA12878 loading base and posterior estimate partitions for NA12878 writing VCF for NA12878 scoring completed for NA12878 in 0.09 sec finished reading samples in 234.32 sec G1K-P3-------------------------------------------------------------- MetaSV-------------------------------------------------------------- BreakSeq-------------------------------------------------------------- Pindel-------------------------------------------------------------- Tigra-------------------------------------------------------------- cnMOPS-------------------------------------------------------------- CNVnator-------------------------------------------------------------- Delly-------------------------------------------------------------- GATK-------------------------------------------------------------- GenomeSTRiP-------------------------------------------------------------- Hydra-------------------------------------------------------------- Lumpy-------------------------------------------------------------- BreakDancer-------------------------------------------------------------- fusorSV-------------------------------------------------------------- run 0 in 5954.49 sec ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

As an example, I passed as -i dir a directory containing only VCF from CNVnator, Lumpy and Breakdancer. In case I don't run all callers, should I modify something to tell FusorSV the callers I ran? Do you please have any advice? P.s.: as suggested in another post, I changed "chroms = args.chrom.split(',')" to "chroms = args.chroms.split(',')" in FusorSV.py script.

lslochov commented 6 years ago

Hi, the dev branch requires additional command line parameters that are not required on the master branch. In order to adapt FusorSV for use with reference genomes other than hg19, we made it possible to specify not just a reference genome, but also the corresponding coordinate offset map and SV mask file. These files for hg19 are provided in the data bundle with FusorSV. When the dev branch is ready to be merged into the master branch we will update the data bundle to include hg38 files. In the meantime, I can send you the hg38 files we're using in our own analysis runs. You can use them with FusorSV by modifying your command line to include:

--coor path_to_coordinate_offset_file --sv_mask path_to_sv_mask_file

Please let me know the best way to send these files to you.

MaestSi commented 6 years ago

Dear lslochov, it would be great if you could send them to simone.maestri@univr.it. I am looking forward to start trying it out! Thanks

MaestSi commented 6 years ago

Unfortunately, no way to make it work for me. In $VCF_FILES_DIR I have directory NA12878 containing files NA12878_S10.vcf NA12878_S11.vcf NA12878_S18.vcf NA12878_S35.vcf NA12878_S4.vcf. This is the full command I gave, using dev branch of SVE and additional input files:

GENOME_VER=hg38
WORKING_DIR=/mnt/cifs01/simone/NA12878
BAM=NA12878.bam
OUTPUT_DIR=$WORKING_DIR/test_FusorSV
FusorSV_OUTPUT_DIR=$OUTPUT_DIR/FusorSV_output_VCF_dir
VCF_FILES_DIR=$OUTPUT_DIR/VCF_files
SAMPLE_DIR=$VCF_FILES_DIR/NA12878
SVE_HOME=/mnt/cifs01/simone/software/SVE
SVE=$SVE_HOME/bin/sve
FusorSV=$SVE_HOME/scripts/FusorSV/FusorSV.py
PATH=/mnt/cifs01/simone/software/miniconda2/bin:$PATH
PATH=$SVE_HOME/:$PATH
PYTHON=/mnt/cifs01/simone/software/miniconda2/bin/python
REFERENCE=Homo_sapiens_assembly38.fasta
CHROMS='chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY,chrM'
LIFTOVER_PATH=$SVE_HOME/scripts/FusorSV/data/liftover/hg19ToHg38.over.chain.gz
COOR_OFFSET=$SVE_HOME/scripts/FusorSV/data/Homo_sapiens_assembly38_coordinates.json
SV_MASK=$SVE_HOME/scripts/FusorSV/data/Homo_sapiens_assembly38.svmask.fasta_svmask.json
EXCLUDE='1,9,13,14,17,36,38'

$PYTHON $FusorSV -f $SVE_HOME/scripts/FusorSV/data/models/default.pickle -c $CHROMS -r $WORKING_DIR"/"$REFERENCE -L $LIFTOVER_PATH \
--coor $COOR_OFFSET --sv_mask $SV_MASK -i $VCF_FILES_DIR -p 24 -E $EXCLUDE -o $FusorSV_OUTPUT_DIR

And this is only a part of the output:

no contig directory specified error parsing comma seperated list, using defaults defaults stage exclude list is: [1, 36] processing samples ['/mnt/cifs01/simone/NA12878/test_FusorSV/VCF_files'] for chroms ['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22', 'chrX', 'chrY', 'chrM'] merging the svmask regions svmask regions merged in 0.0 sec reading, parsing, partitioning and writing sample VCFs reading sample VCF_files finished reading 1 out of 1 samples generating 0 partitions in 0.52 sec [...] finished estimation in 5716.33 sec apply fusion model to sample inputs and generating fusorSV ouput starting fusorSV discovery on sample VCF_files loading base and posterior estimate partitions for VCF_files writing VCF for VCF_files scoring completed for VCF_files in 0.37 sec finished reading samples in 728.5 sec G1K-P3-------------------------------------------------------------- MetaSV-------------------------------------------------------------- BreakSeq-------------------------------------------------------------- Pindel-------------------------------------------------------------- Tigra-------------------------------------------------------------- cnMOPS-------------------------------------------------------------- CNVnator-------------------------------------------------------------- Delly-------------------------------------------------------------- GATK-------------------------------------------------------------- GenomeSTRiP-------------------------------------------------------------- Hydra-------------------------------------------------------------- Lumpy-------------------------------------------------------------- BreakDancer-------------------------------------------------------------- fusorSV-------------------------------------------------------------- run 0 in 6706.31 sec

No errors are given, but output VCF file is empty a part from the header. Moreover, I noticed that if I put a '/' at the end of $VCF_FILES_DIR, an error is given ("Partition thread crash"), so I am really clueless. If you spot any incongruences with respect to the command you give, I could try it out. Thanks

lslochov commented 5 years ago

For future reference, we resolved this issue when we discovered that a small number of HLA variants in the input VCFs had malformed chromosome names—the name was split over the first 2 columns, leading to invalid "chr" and "pos" values.

For example, what should have been

HLA-A*01[tab]123456

ended up becoming

HLA[tab]A*01

A new version of FusorSV is under development that will produce helpful error messages when input VCFs are malformed.

lslochov commented 5 years ago

Hi Jie, for clarification purposes, are you referring to the hg38 coordinate offset map and SV mask file?

--

Lucas Lochovsky Postdoctoral Associate The Jackson Laboratory for Genomic Medicine Ten Discovery Dr Farmington, CT 06032 860-837-2155 lucas.lochovsky@jax.org www.jax.org

The Jackson Laboratory: Leading the search for tomorrow's cures

From: Jie Wang notifications@github.com Reply-To: TheJacksonLaboratory/SVE reply@reply.github.com Date: Thursday, January 3, 2019 at 3:06 PM To: TheJacksonLaboratory/SVE SVE@noreply.github.com Cc: Lucas Lochovsky Lucas.Lochovsky@jax.org, State change state_change@noreply.github.com Subject: Re: [TheJacksonLaboratory/SVE] FusorSV for GRCh38 (#10)

Would you please send me the two hg38 files? jessie.wangjie@gmail.commailto:jessie.wangjie@gmail.com Thanks.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/TheJacksonLaboratory/SVE/issues/10#issuecomment-451261411, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AF14vAHrcq-_oIiCxhbDvBLGQcZ4KRxVks5u_mLTgaJpZM4XZFw7.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

MateuszChilinski commented 5 years ago

@lslochov Could you please provide me with HG38 coordinate offset file as well as sv mask file? I am having very similar issues to Simone and I believe those files could help me fix it. Email - got it, thanks :-)

m081429 commented 4 years ago

Could you please provide me with HG38 coordinate offset file as well as sv mask file? napr4836@gmail.com

klitgord commented 4 years ago

Hello @islochov, I'd also be interested in using the hg38 coordinate offset and sv mask files. Would you mind sending them to me **** or perhaps point me to a repo or other resource where I might fetch them? Much thanks in advance. -niels update: Just got them, Mateusz Chilinski was kind enough to share them with me (thank you again)