WenjinGudaisy / SearcHPV

An HPV integration point detection tool for targeted capture sequencing data
MIT License
4 stars 2 forks source link

Empty outputs generated #1

Open Ismail-Curie opened 2 years ago

Ismail-Curie commented 2 years ago

Hello,

I get empty files HPVfusionPointContig and ContigSequence.fa.

Also, you left your paths to the tools in the scripts and some of the tools you are using are very old versions (GATK, Pear).

Thanks, Ismail

WenjinGudaisy commented 2 years ago

Hi Ismail,

Thanks for letting me know the issue! A latest version with a conda environment setting will be uploaded soon this week. I'll let you know when it's available.

Thanks, Wenjin

WenjinGudaisy commented 2 years ago

Hi,

I've updated the latest version and we can set up a conda environment. Please try again and let me know if there are any issues.

Thanks, Wenjin

Ismail-Curie commented 2 years ago

thank you @WenjinGudaisy, I have the same issue reported by jonbra in issues 2 and 3.

WenjinGudaisy commented 2 years ago

Hi,

I just updated and fixed this problem. Please use the latest version of "environment.yml" file.

Thanks, Wenjin

Turstein commented 2 years ago

Hi! I tried the latest version and everything seems fine, except the output files are empty. I guess this would make sense if I don't have any virus integration, or could there be another explanation? Do you have an "integration-positive" test file?

Ismail-Curie commented 2 years ago

Hello,

The latest version works without any errors, but I still get empty outputs, even if I am sure of the existence of some virus integration sites. I've tried other tools that confirm that.

Thank you, Ismail

WenjinGudaisy commented 2 years ago

Hi,

Do you have any results under the second step folder "call_fusion"? If there are integration sites, the file "[name of virus].genome_fusion.txt" should have some integration sites, like these:

chrm pos single_evidence paired_evidence 1 28126740 2 2

But if the numbers of supportive reads are lower than the cutoff, the integration sites won't be reported on the final step since this method is designed for targeted capture data which requires higher background read depth than WGS.

For the "integration positive" test file, you can use the WES cervical cancer data from TCGA cohort. For example, "TCGA-C5-A2LX" is an HPV16 positive sample and has integrations. I've called lots of integrations on it by SearcHPV.

Thanks, Wenjin

Ismail-Curie commented 2 years ago

hi @WenjinGudaisy, Thank you for your reply. My .genome_fusion.txt file is empty, I have no integration sites, I also have both targeted capture and WGS data. thank you, Ismail

WenjinGudaisy commented 2 years ago

Hi,

Did you check the bam file under first step and make sure it's correct? For the other caller that have reported some integrations, are they using WGS data or targeted capture data? May I know the name of the caller?

Thanks, Wenjin

Turstein commented 2 years ago

Thank you for the help! I only found TCGA-C5-A2LX in single read fastq so far, but I have the following in my own data in the virus.genome_fusion.txt:

chrm pos single_evidence paired_evidence chr5 169145783 1 1

Call_fusion_virus-files are empty/only headers.

I couldn't find the explanation of single_evidence/paired_evidence. What do they mean? And what is the actual cut-off based on? Number of supporting reads? How many is that and can it be adjusted?

I am sorry to Ismail for hijacking your post. I think I had the same problem, but it might have been just the number of reads that were too low, or because I named the fasta-files .fasta. I changed that to .fa. And I have been going step-by-step and made sure I had the full path in the output. Not sure if that matters: -output /home/turstein/sample

WenjinGudaisy commented 2 years ago

Hi,

You could find the explanation of single_evidence/paired_evidence in the SearcHPV manuscript “Method" part, https://pubmed.ncbi.nlm.nih.gov/34160069/. Single evidence is the split read that support the integration, paired evidence is the paired end read that support the integration. The cut-off criteria for identifying the fusion points were based on empirical practice. If there are more than 2 split reads and 2 paired end reads or split reads + paired end reads > 5, then the integration will be report and go to step3 and step4.

For your sample, I suggest you to look at that location in your bam file using IGV, etc. It large likely seems not to be a signal I guess.

Thanks, Wenjin

WenjinGudaisy commented 2 years ago

Naming the fasta file ".fasta" won't influence the results. If you didn't see any error message and there are no results files in the folders of step3 and step4, it means the number of supportive reads is too low.

Ismail-Curie commented 2 years ago

Hi @WenjinGudaisy, The other caller was SurVirus. Ismail

WenjinGudaisy commented 2 years ago

Hi Ismail,

Thanks for you reply! Just one more question, did you run SurVirus on WGS or targeted sequencing data? If there were integrations reported from WGS data but fell into intron or un-targeted region, it's likely there were no results for targeted sequencing data.

WenjinGudaisy commented 2 years ago

For the TCGA dataset, you could find the WES bam file for CESC cohort. And then you can convert the bam file to fastq files, by "samtools fastq" command to try with SearcHPV.

Turstein commented 2 years ago

Thank you! I don't have access, but I tried SRR1610997 which does have integration (this is just one line from call_fusion/HPV16.genome_fusion: "chr13 73657301 166 347". ) This looks nice in the BAM-file and the literature also supports there being an integration.

However, the assemble directory contains only three files: cap3

!/bin/bash

echo 'CAP3 done'

extractReadSequence

!/bin/bash

echo 'extract informative sequences done'

pear

!/bin/bash

And the call_fusion_virus directory contains the following files only with the text "#!/bin/bash": alignContigsToGenome alignContigsToRef alignContigsToVirus

An empty file: contigsSequence.fa

And filteredSelectedContig.pickle (1kb) and the VirusFusionPointContig which only says: "contigName genomeInsertionChromosome genomeInsertionPoint hpvInsertionPoint SiteConfidence ContigConfidence"

WenjinGudaisy commented 2 years ago

Hi,

Thanks for this information. I’ll run this sample on my end and let you know. Did you see any error message? It seems step2 worked well but there were some issues in step3.

Thanks, Wenjin

Wenjin Gu PhD candidate Department of Computational Medicine & Bioinformatics, University of Michigan, Mills Lab, Room 2055, Palmer Commons, 100 Washtenaw Ave, Ann Arbor, 48109

On Mar 14, 2022, at 8:02 PM, Turstein @.***> wrote:

Thank you! I don't have access, but I tried SRR1610997 which does have integration (this is just one line from call_fusion/HPV16.genome_fusion: "chr13 73657301 166 347". ) This looks nice in the BAM-file and the literature also supports there being an integration.

However, the assemble directory contains only three files: cap3

!/bin/bash

echo 'CAP3 done'

extractReadSequence

!/bin/bash

echo 'extract informative sequences done'

pear

!/bin/bash

And the call_fusion_virus directory contains the following files only with the text "#!/bin/bash": alignContigsToGenome alignContigsToRef alignContigsToVirus

An empty file: contigsSequence.fa

And filteredSelectedContig.pickle (1kb) and the VirusFusionPointContig which only says: "contigName genomeInsertionChromosome genomeInsertionPoint hpvInsertionPoint SiteConfidence ContigConfidence"

— Reply to this email directly, view it on GitHub https://github.com/WenjinGudaisy/SearcHPV/issues/1#issuecomment-1067425104, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTREBCZMHYCKIVFDRWZ7JDU77HSPANCNFSM5LHYSUMQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub. You are receiving this because you were mentioned.

Turstein commented 2 years ago

Just the ones related to Picard, but they seem to fix themselves.

** NOTE: Picard's command line syntax is changing.


** For more information, please see: ** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


** The command line looks like this in the new syntax: ...

No error messages in step 3 or 4:

searcHPV -assemble -fastq1 ../SRR1610997_1.fastq -fastq2 ../SRR1610997_2.fastq -humRef ../../GRCh38p12.fa -virRef ../../HPV16A1.fa extract informative sequences done CAP3 done

searcHPV -hpvFusion -fastq1 ../SRR1610997_1.fastq -fastq2 ../SRR1610997_2.fastq -humRef ../../GRCh38p12.fa -virRef ../../HPV16A1.fa

WenjinGudaisy commented 2 years ago

Hi, I've applied SearcHPV on your datasets and it worked well. Here are my results: VirusFusionPointContig.txt: contigName genomeInsertionChromosome genomeInsertionPoint hpvInsertionPoint SiteConfidence ContigConfidence 13.74231438.Contig1 13 74231438 1871 High Confidence Site High Confidence Contig

13.74230820.Contig1 13 74230820 4523 High Confidence Site High Confidence Contig

ContigsSequence.fa:

13.74231438.Contig1 CATTATTTGGTATGAGTTTAATAAAATTTCTGCAAGGGTCTGTAATATGTTTTGTAAATTCTAAAAGCCATTTTTGGTTACAACCATTAGCAGATGCCAAAATAGGTATGTTAGATGATGCTACAGTGCCCTGTTGGAACTACATAGATGACAATTTAAGAAATGCATTGGATGGAAATTTAGTTTCTATGGATGTAAAGCATAGACCATTGGTACAACTAAAATGCCCTCCATTATTAATTACATCTAACATTAATGCTGGTACAGATTCTAGGTGGCCTTATTTACATAATAGATTGGTGGTGTTTACATTTCCTAATGAGTTTCCATTTGACGAAAACGGAAATCCAGTGTATGAGCTTAATGATAAGAACTGGAAATCCTTTTTCTCAAGGACGTGGTCCAGATTAAGTTTGCACGATGGAACAAGGGAGAGAGGGAGGAGGTGGCAGCCTCTTTTAAACAACTAGACCATGTATGAACTGGTAGAGTGAGAGCTCACTCATTATTATGAGGAGGGCACCACACCATTCATAAGGGATCCACCCCCATGACCCAAACACCTCCCAATAGGTCCCATCTCCAACTTTGGGGATCACATTTTGACAAACTATAAACCTCCAAACTATATCAACAGGCCTTAATCATTGGATGGGGCAACC

13.74230820.Contig1 TCATCCGTGCTTACAACCTGAGATACTGGGACAGGAGGCAAGTAGACAGTGGCCTCACTAGGCAGCCAAAGAGACATCTGAAAAAAAATATGGTAAACGTTTACGTCGTTTTCGTAACATGTAATAACTAGGATGTAAATAAAAGTCACCTGCATCAGCAATAATTGTATATTGTGGAGACCCTGGAACTATAGGAAGTAACGAAGGAGCTTGGTCAGTTATATTAATGGGTATATCAGGACCTGATACTAAAGGAATATTGTATGCACCACCAAAAGGAATTGTTGTATTTGCAGGAATATAACCTGATAAAGATGTAGAGGGTACATCGTGTATTTCAGAGTTTTTGCTTTAAAGGTTAGGTATTATTAATACAGTGCTGAAGAGCAGATATTTCTTTTTTTTGCTAACAGAGGAGAACTAAAACCAACTCTGAGTGAAGAAAGGAGTGTTTAAAAGGAGAGAATCACAGGCAGTCATTGTAAAAACTGGTACCTGCCTGGCTTTAAGGCTCTATACCTTGTAATATAGTATAGTATAGTAATTTAAAACATGCTACTAGTATGTTTATATTACTATAGTAC

If you would like me to help you debug on your end, you could share me the detailed file lists. And your conda package lists.

Thanks, Wenjin

Turstein commented 2 years ago

Thank you! :)

Do you mean like this? ls -l main folder: total 8787112 drwxr-xr-x 2 togf togf 4096 Mar 20 16:12 alignment drwxr-xr-x 2 togf togf 4096 Mar 20 16:12 assemble drwxr-xr-x 2 togf togf 4096 Mar 20 16:12 call_fusion drwxr-xr-x 2 togf togf 4096 Mar 20 16:28 call_fusion_virus -rw-r--r-- 1 togf togf 69194 Mar 20 14:38 hg_hpv.dict -rw-r--r-- 1 togf togf 3306436400 Mar 20 13:48 hg_hpv.fa -rw-r--r-- 1 togf togf 21521 Mar 20 14:20 hg_hpv.fa.amb -rw-r--r-- 1 togf togf 27934 Mar 20 14:20 hg_hpv.fa.ann -rw-r--r-- 1 togf togf 3252216888 Mar 20 14:20 hg_hpv.fa.bwt -rw-r--r-- 1 togf togf 20462 Mar 20 14:38 hg_hpv.fa.fai -rw-r--r-- 1 togf togf 813054201 Mar 20 14:20 hg_hpv.fa.pac -rw-r--r-- 1 togf togf 1626108448 Mar 20 14:38 hg_hpv.fa.sa

Alignment folder: total 1232932 -rw-r--r-- 1 togf togf 1256315675 Mar 20 16:12 alignment.RG.indelre.mkdup.sort.bam -rw-r--r-- 1 togf togf 5279184 Mar 20 16:12 alignment.RG.indelre.mkdup.sort.bam.bai -rw-r--r-- 1 togf togf 3748 Mar 20 16:10 alignment.RG.indelre.mkdup.txt -rw-r--r-- 1 togf togf 888993 Mar 20 16:01 alignment.RG.intervals -rwxr-xr-x 1 togf togf 345 Mar 20 13:48 alignment.sh -rw-r--r-- 1 togf togf 1096 Mar 20 13:48 indel.alignment.sh -rw-r--r-- 1 togf togf 243 Mar 20 13:48 index.sh -rw-r--r-- 1 togf togf 771 Mar 20 13:48 mkdup.alignment.sh -rw-r--r-- 1 togf togf 614 Mar 20 13:48 orignal.alignment.sh -rw-r--r-- 1 togf togf 573 Mar 20 13:48 rm_inter.sh

Assemble folder: total 12 -rwxr-xr-x 1 togf togf 28 Mar 20 16:20 cap3.sh -rwxr-xr-x 1 togf togf 53 Mar 20 16:20 extractReadSequence.sh -rwxr-xr-x 1 togf togf 12 Mar 20 16:20 pear.sh

call_fusion folder: total 8 -rw-r--r-- 1 togf togf 0 Mar 20 16:12 HPV16A1.all.clustered.result -rw-r--r-- 1 togf togf 0 Mar 20 16:12 HPV16A1.all.filtered.clustered.result -rw-r--r-- 1 togf togf 0 Mar 20 16:12 HPV16A1.all.result -rw-r--r-- 1 togf togf 1711 Mar 20 16:12 HPV16A1.genome_fusion.sort.txt -rw-r--r-- 1 togf togf 1711 Mar 20 16:12 HPV16A1.genome_fusion.txt

call_fusion_virus folder: total 20 -rw-r--r-- 1 togf togf 0 Mar 20 16:28 ContigsSequence.fa -rw-r--r-- 1 togf togf 108 Mar 20 16:28 VirusFusionPointContig.txt -rwxr-xr-x 1 togf togf 12 Mar 20 16:28 alignContigsToGenome.sh -rwxr-xr-x 1 togf togf 12 Mar 20 16:28 alignContigsToRef.sh -rwxr-xr-x 1 togf togf 12 Mar 20 16:28 alignContigsToVirus.sh -rw-r--r-- 1 togf togf 6 Mar 20 16:28 filteredSelectedContig.pickle

Conda list:

packages in environment at /home/togf/miniconda3/envs/searcHPV:

#

Name Version Build Channel

_libgcc_mutex 0.1 main _openmp_mutex 4.5 1_gnu _r-mutex 1.0.0 anacondar_1 argparse 1.4.0 pypi_0 pypi blas 1.0 mkl bwa 0.7.15 1 bioconda bzip2 1.0.8 h7b6447c_0 ca-certificates 2021.10.26 h06a4308_2 cairo 1.14.12 h8948797_3 cap3 10.2011 h779adbc_3 bioconda certifi 2021.10.8 py37h06a4308_2 curl 7.69.1 hbc83047_0 fontconfig 2.13.1 h6c09931_0 freetype 2.11.0 h70c0345_0 fribidi 1.0.10 h7b6447c_0 gatk 3.8 hdfd78af_11 bioconda glib 2.63.1 h5a9c865_0 graphite2 1.3.14 h23475e2_0 harfbuzz 1.8.8 hffaf4a1_0 htslib 1.9 h244ad75_9 bioconda icu 58.2 he6710b0_3 intel-openmp 2021.4.0 h06a4308_3561 java-jdk 8.0.112 1 bioconda jpeg 9d h7f8727e_0 krb5 1.17.1 h173b8e3_0 libcurl 7.69.1 h20c2e04_0 libdeflate 1.3 h516909a_0 conda-forge libedit 3.1.20181209 hc058e9b_0 libffi 3.2.1 hf484d3e_1007 libgcc 7.2.0 h69d50b8_2 libgcc-ng 9.3.0 h5101ec6_17 libgomp 9.3.0 h5101ec6_17 libpng 1.6.37 hbc83047_0 libssh2 1.9.0 h1ba5d50_1 libstdcxx-ng 9.3.0 hd4cf53a_17 libtiff 4.2.0 h85742a9_0 libuuid 1.0.3 h7f8727e_2 libwebp-base 1.2.0 h27cfd23_0 libxcb 1.14 h7b6447c_0 libxml2 2.9.12 h03d6c58_0 lz4-c 1.9.3 h295c915_1 mkl 2021.4.0 h06a4308_640 mkl-service 2.4.0 py37h7f8727e_0 mkl_fft 1.3.1 py37hd3c417c_0 mkl_random 1.2.2 py37h51133e4_0 ncurses 6.1 he6710b0_1 numpy 1.21.2 py37h20f2e39_0 numpy-base 1.21.2 py37h79a1101_0 openjdk 8.0.112 zulu8.19.0.1_3 conda-forge openssl 1.1.1m h7f8727e_0 pandas 1.2.4 py37h2531618_0 pango 1.42.4 h049681c_0 pcre 8.45 h295c915_0 pear 0.9.6 h36cd882_7 bioconda picard 2.23.8 0 bioconda pip 21.2.2 py37h06a4308_0 pixman 0.40.0 h7f8727e_1 pysam 0.15.3 py37hbcae180_3 bioconda python 3.7.3 h0371630_0 python-dateutil 2.8.2 pyhd3eb1b0_0 pytz 2021.3 pyhd3eb1b0_0 r 3.2.2 0 r-base 3.2.2 0 r-bitops 1.0_6 r3.2.2_1a r-boot 1.3_17 r3.2.2_0a r-catools 1.17.1 r3.2.2_0 bioconda r-class 7.3_14 r3.2.2_0a r-cluster 2.0.3 r3.2.2_0a r-codetools 0.2_14 r3.2.2_0a r-colorspace 1.2_6 r3.2.2_0a r-dichromat 2.0_0 r3.2.2_2a r-digest 0.6.9 r3.2.2_0 bioconda r-foreign 0.8_66 r3.2.2_0a r-gdata 2.17.0 r3.2.2_0a r-ggplot2 2.1.0 r3.2.2_0 bioconda r-gplots 2.17.0 r3.2.2_0a r-gsalib 2.1 r3.2.2_0 bioconda r-gtable 0.1.2 r3.2.2_2a r-gtools 3.5.0 r3.2.2_0a r-kernsmooth 2.23_15 r3.2.2_0a r-labeling 0.3 r3.2.2_1a r-lattice 0.20_33 r3.2.2_0a r-magrittr 1.5 r3.2.2_0 bioconda r-mass 7.3_45 r3.2.2_0a r-matrix 1.2_2 r3.2.2_0a r-mgcv 1.8_9 r3.2.2_0a r-munsell 0.4.2 r3.2.2_1a r-nlme 3.1_122 r3.2.2_0a r-nnet 7.3_11 r3.2.2_0a r-plyr 1.8.3 r3.2.2_0a r-rcolorbrewer 1.1_2 r3.2.2_2a r-rcpp 0.12.2 r3.2.2_0a r-recommended 3.2.2 r3.2.2_0 r-reshape 0.8.5 r3.2.2_1a r-reshape2 1.4.1 r3.2.2_1a r-rpart 4.1_10 r3.2.2_0a r-scales 0.3.0 r3.2.2_0a r-spatial 7.3_11 r3.2.2_0a r-stringi 1.0_1 r3.2.2_0a r-stringr 1.0.0 r3.2.2_0a r-survival 2.38_3 r3.2.2_0a readline 7.0 h7b6447c_5 samtools 1.9 h10a08f8_12 bioconda searchpv 1.0.10 pypi_0 pypi setuptools 58.0.4 py37h06a4308_0 six 1.16.0 pyhd3eb1b0_0 sqlite 3.31.1 h7b6447c_0 tk 8.6.11 h1ccaba5_0 wheel 0.37.1 pyhd3eb1b0_0 xz 5.2.5 h7b6447c_0 zlib 1.2.11 h7f8727e_4 zstd 1.4.9 haebb681_0

Thank you, again :) Torstein

meisproject commented 1 month ago

I have met the same case. And in mine, it seems like caused by the chromosome name in the genome file. The chromosome names are all starting with 'chr', and in the fusion calling step, there is a chromosome filter, and that is without 'chr'. So the filter result {out_dir}/{virus_chrm_file_name}.all.result is empty, and also the assemble result is empty.