bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
194 stars 53 forks source link

Issue with Varscan modify #29

Closed tyler5huang closed 7 years ago

tyler5huang commented 7 years ago

Error and change logs:

1. Traceback (most recent call last): File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 116, in with genome.open_textfile(right_files[0]) as vcf: File "/home/huangwt/Codes/somaticseq/genomic_file_handlers.py", line 224, in open_textfile return gzip.open(file_name, 'rt') File "/mnt/software/src/Python-3.2.3/Lib/gzip.py", line 46, in open return GzipFile(filename, mode, compresslevel) File "/mnt/software/src/Python-3.2.3/Lib/gzip.py", line 156, in init raise IOError("Mode " + mode + " not supported") IOError: Mode rt not supported

Changed:

File "/home/huangwt/Codes/somaticseq/genomic_file_handlers.py", line 224, in open_textfile

return gzip.open(file_name, 'rt') to return gzip.open(file_name, 'r')

2. Traceback (most recent call last): File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 126, in while line_i.startswith('#'): TypeError: startswith first arg must be bytes or a tuple of bytes, not str

Changed:

File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 126, in

while line_i.startswith('#'): to while line_i.startswith(b'#'):

3. Traceback (most recent call last): File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 128, in if re.match(r'##fileformat=', line_i): File "/mnt/software/src/Python-3.2.3/Lib/re.py", line 153, in match return _compile(pattern, flags).match(string) TypeError: can't use a string pattern on a bytes-like object

litaifang commented 7 years ago

What was the file name of your VarScan's VCF file?

tyler5huang commented 7 years ago

icgc_cll-varscan-annotated.vcf.gz

Sent from my iPhone

On 24 Jul 2017, at 3:37 PM, Li Tai Fang notifications@github.com wrote:

What was the file name of your VarScan's VCF file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

litaifang commented 7 years ago

Is it possible that you send me the vcf.gz file so I can check? I'm also wondering if Python 3.2.3's gzip library isn't somehow different from later versions.

Also, can you run docker on your end? If so, we've just dockerized SomaticSeq: https://hub.docker.com/r/lethalfang/somaticseq/

litaifang commented 7 years ago

Actually, why don't you unpack the bgzip'ed VCF file, and see if that fixes your problem.

tyler5huang commented 7 years ago

hi my vcf.gz files are >500MB each. So I created a smaller file, which is in the .vcf format (not vcf.gz). The error I get is this:

 [huangwt@n006 Real-bcbio103-truth]$ $myCodes/SomaticSeq.Wrapper.sh > --mutect2 $myDir/mutect.vcf > --varscan-snv $myDir/varscan.vcf > --vardict $myDir/vardict.vcf > --ada-r-script ada_model_builder.R > --truth-snv $myResults/R1.truth.snv.vcf > --output-dir $myResults/somaticseq --mutect2 '/mnt/projects/huangwt/wgs/smurf/test/mutect2.vcf' --varscan-snv '/mnt/projects/huangwt/wgs/smurf/test/varscan.vcf' --vardict '/mnt/projects/huangwt/wgs/smurf/test/vardict.vcf' --ada-r-script 'ada_model_builder.R' --truth-snv '/mnt/projects/huangwt/wgs/Results-SMuRF/Real-bcbio103-truth/R1.truth.snv.vcf' --output-dir '/mnt/projects/huangwt/wgs/Results-SMuRF/Real-bcbio103-truth/somaticseq' -- Traceback (most recent call last):  File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 129, in     while line_i.startswith(b'#'):TypeError: startswith first arg must be str or a tuple of str, not bytes

HUANG Weitai

On Tuesday, July 25, 2017, 2:45:31 PM GMT+8, Li Tai Fang notifications@github.com wrote:

Actually, why don't you unpack the bgzip'ed VCF file, and see if that fixes your problem.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

tyler5huang commented 7 years ago

Hi I tried again using your original code : while line_i.startswith('#'):  It returned with the error: Error: Unable to access jarfile CombineVariants


hi my vcf.gz files are >500MB each. So I created a smaller file, which is in the .vcf format (not vcf.gz). The error I get is this:

 [huangwt@n006 Real-bcbio103-truth]$ $myCodes/SomaticSeq.Wrapper.sh > --mutect2 $myDir/mutect.vcf > --varscan-snv $myDir/varscan.vcf > --vardict $myDir/vardict.vcf > --ada-r-script ada_model_builder.R > --truth-snv $myResults/R1.truth.snv.vcf > --output-dir $myResults/somaticseq --mutect2 '/mnt/projects/huangwt/wgs/smurf/test/mutect2.vcf' --varscan-snv '/mnt/projects/huangwt/wgs/smurf/test/varscan.vcf' --vardict '/mnt/projects/huangwt/wgs/smurf/test/vardict.vcf' --ada-r-script 'ada_model_builder.R' --truth-snv '/mnt/projects/huangwt/wgs/Results-SMuRF/Real-bcbio103-truth/R1.truth.snv.vcf' --output-dir '/mnt/projects/huangwt/wgs/Results-SMuRF/Real-bcbio103-truth/somaticseq' -- Traceback (most recent call last):  File "/home/huangwt/Codes/somaticseq/modify_VJSD.py", line 129, in     while line_i.startswith(b'#'):TypeError: startswith first arg must be str or a tuple of str, not bytes

HUANG Weitai

On Tuesday, July 25, 2017, 2:45:31 PM GMT+8, Li Tai Fang notifications@github.com wrote:

Actually, why don't you unpack the bgzip'ed VCF file, and see if that fixes your problem.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

litaifang commented 7 years ago

The script uses GATK to combine all the VCF files from different callers (i.e., GATK CombineVariants). You can point to the location of the GATK.jar file by --gatk $PATH/TO/GATK/GenomeAnalysis.jar

Alternatively, you can download the latest version 2.2.5. There, without --gatk, it'll just use cat and the vcfsorter.pl script to combine and sort those VCF files.

tyler5huang commented 7 years ago

I provided the path to --gatk but it returned with this error:

Picked up _JAVA_OPTIONS: -XX:+UseSerialGC

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.0-39-gd091f72):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: Failed to parse value null for argument referenceFile. This is most commonly caused by providing an incorrect data type (e.g. a double when an int is required)
ERROR ------------------------------------------------------------------------------------------
litaifang commented 7 years ago

I have not tried GATK version 2 before. Can you give GATK3 a try? GATK4 beta doesn't work for now.

tyler5huang commented 7 years ago

Trying with GATK3.7:

Picked up _JAVA_OPTIONS: -XX:+UseSerialGC Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 52.0 at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

tyler5huang commented 7 years ago

any specific GATK version to use?

litaifang commented 7 years ago

I've tried most versions of GATK 3, including 3.7 and hasn't had a problem so far. To get a detailed description of how each step in the script works, the documentation is in the docs folder: https://github.com/bioinform/somaticseq/blob/master/docs/Manual.pdf

Starting from page 4 is the step-by-step guide of the pipeline.

tyler5huang commented 7 years ago

Hi I tried with python3.6 (instead of python3.2) with the corresponding dependencies and consolidated the calls. I ran the r scripts to train and predict on my own and they work as well. thanks