Fail to do variant calling

PanZiwei commented 3 years ago

Hi, I have several issues when doing the variant calling and would really appreciate it if you can help:

I am trying to install your packages but got the error when install parallel package with pip3 install parallel/pip install parallel/pip install parallel==0.2.5

Here is the error information:

\Collecting parallel==0.2.5
  Using cached parallel-0.2.5.tar.gz (57 kB)
    ERROR: Command errored out with exit status 1:
     command: /projects/li-lab/Ziwei/Anaconda3/envs/NanoCaller/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-2t3yafw8/parallel_3f9d07c107ca42d2bba5ddd63ac5efed/setup.py'"'"'; __file__='"'"'/tmp/pip-install-2t3yafw8/parallel_3f9d07c107ca42d2bba5ddd63ac5efed/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-md6zu29q
         cwd: /tmp/pip-install-2t3yafw8/parallel_3f9d07c107ca42d2bba5ddd63ac5efed/
    Complete output (8 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-2t3yafw8/parallel_3f9d07c107ca42d2bba5ddd63ac5efed/setup.py", line 5, in <module>
        import pprocess
      File "/tmp/pip-install-2t3yafw8/parallel_3f9d07c107ca42d2bba5ddd63ac5efed/pprocess.py", line 255
        raise AcknowledgementError, obj
                                  ^
    SyntaxError: invalid syntax
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/31/5b/66966fb4d103191b7cbc92730db6a335986fbdb3d9f55cbb54b7ba87e9d4/parallel-0.2.5.tar.gz#sha256=2c9f08dac392859c83e43c9f0e38fb4b4e1516b54cdd5fda8da20a3cf5f75498 (from https://pypi.org/simple/parallel/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Could not find a version that satisfies the requirement parallel==0.2.5
ERROR: No matching distribution found for parallel==0.2.5

Any idea on this issue?

I tried to call variants with the command below:

python $nanocaller_dir/scripts/NanoCaller_WGS.py \
-bam $output_dir/data.bam \
-ref $ref_dir/GRCh37_new.fa \
-prefix data \
-o $output_dir \
-cpu $CPU \
-seq ont

But I got the error:

2021-02-02 00:09:39.582079: Starting NanoCaller.

Running arguments are saved in the following file: /fastscratch/c-panz/HL60/args

2021-02-02 00:09:39.591100: Commands for running NanoCaller on contigs in whole genome are saved in the file /fastscratch/c-panz/HL60/wg_commands
Running 300 jobs using 16 workers in parallel.

/bin/sh: parallel: command not found

ls: cannot access /fastscratch/c-panz/HL60/intermediate_files/*/*snps.vcf.gz: No such file or directory
Writing to /tmp/bcftools-sort.DM8H4p
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

Writing to /tmp/bcftools-sort.Qfioho
ls: cannot access /fastscratch/c-panz/HL60/intermediate_files/*/*snps.phased.vcf.gz: No such file or directory
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

Writing to /tmp/bcftools-sort.pGwRmr
ls: cannot access /fastscratch/c-panz/HL60/intermediate_files/*/*indels.vcf.gz: No such file or directory
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

Writing to /tmp/bcftools-sort.0RSJtu
ls: cannot access /fastscratch/c-panz/HL60/intermediate_files/*/*final.vcf.gz: No such file or directory
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

2021-02-02 00:09:39.713361: Total Time Elapsed: 0.14 seconds
Nanocaller is done!

How can I solve the problem?

In Usage.md you mentioned that "you can run NanoCaller for whole genome variant calling is to use NanoCaller.py or NanoCaller_WGS.py on each chromosome separately by setting chrom argument" with the following command:
```
for i in {1..22};do echo "python PATH_TO_NANOCALLER_REPOSITORY/scripts/NanoCaller.py
-chrom chr$i
-bam YOUR_BAM \
-ref YOUR_REF \
-prefix PREFIX \
-o OUTPUT_DIRECTORY \
-cpu 16
-seq SEQUENCING_TYPE" |qsub -V -cwd -pe smp 16 -N chr$i -e chr$i.log -o chr$i.log; done
```
However, I think qsub is SGE job? Do you have the command version for SLURM job? Can I directly replace qsub with sbatch for SLURM job?

Thank you so much for your help!

umahsn commented 3 years ago

Hi PanZiwei, thank you for bringing this to our attention.

Your second question is related to your first question. Since parallel is not installed properly, as shown by the error /bin/sh: parallel: command not found, NanoCaller_WGS.py is not able to use parallel command to call variants using multiple processors, therefore in the later steps where we use BCFtools to combine VCF files created by earlier step, we get an error because no VCF files were generated ls: cannot access /fastscratch/c-panz/HL60/intermediate_files/*/*snps.vcf.gz: No such file or directory. So fixing the first issue will get rid of this problem.

Regarding the error for parallel command installation, I can suggest a few fixes.

Use conda to install parallel, via the command conda install -c conda-forge parallel.
Use docker image for NanoCaller, it is tested to ensure all packages can be installed without conflict.
Use xargs command instead of parallel, which is included in most Linux distributions by default. You would only need to change line 142 in NanoCaller_WGS.py to run_cmd('''awk '{print "\\""$0"\\""}' %s | xargs -P %d -L1 bash -c''' %(os.path.join(args.output,'wg_commands'), args.cpu))

For SLURM, I believe the following would be the equivalent: sbatch -n 16 --job-name=chr$i -e %x.log -o %x.log

umahsn commented 3 years ago

Also, just wanted to mention that using NanoCaller.py in the way you mentioned in your third question does not require the parallel command. So even if you are having issues with installing parallel, NanoCaller.py can still work.

PanZiwei commented 3 years ago

Hi, Sorry to bother you again. I fixed the parallel command installation issue by using conda install -c conda-forge parallel, but I still got the error ls: cannot access /fastscratch/c-panz/HL60/intermediate_files/*/*snps.vcf.gz: No such file or directory in the log file. Any idea on this issue?

BTWI saw a similar issue https://github.com/WGLab/NanoCaller/issues/12, but I can't use docker to run the package for some reason. So I would really appreciate it if I can get help on conda env.

My wg_commands file:

python /home/c-panz/software/NanoCaller/scripts/NanoCaller.py -chrom chr1 --mode both --sequencing ont --model NanoCaller1 --min_allele_freq 0.15 --min_nbr_sites 1 --bam /fastscratch/c-panz/HL60/bam_cluster/HL60.merged.bam --ref /projects/li-lab/Ziwei/Nanopore/data/reference/hg38.fa --sample SAMPLE --mincov 8 --maxcov 160 --neighbor_threshold 0.4,0.6 --ins_threshold 0.4 --del_threshold 0.6  -cpu 1 --output /fastscratch/c-panz/HL60/NanoCaller_cluster/intermediate_files/chr1_1_10000000 -start 1 -end 10000000 -prefix chr1_1_10000000

Running command:

date
source /projects/li-lab/Ziwei/Anaconda3/etc/profile.d/conda.sh
conda activate NanoCaller

nanocaller_dir=/home/c-panz/software/NanoCaller
ref_dir=/projects/li-lab/Ziwei/eference
bam_dir=/fastscratch/c-panz/HL60/bam
output_dir=/fastscratch/c-panz/HL60/NanoCaller
# number of threads or parallel jobs. Please set this number higher or lower according to the number of processes allowed in your system.
CPU=16
# run NanoCaller
python $nanocaller_dir/scripts/NanoCaller_WGS.py \
-bam $bam_dir/HL60.merged.sorted.bam \
-ref $ref_dir/hg38.fa \
-prefix HL60 \
-o $output_dir \
-cpu $CPU \
-seq ont > $output_dir/HL60.NanoCaller.log
echo "Nanocaller is done!"

Error:

2021-03-08 23:48:01.505590: Starting NanoCaller.

Running arguments are saved in the following file: /fastscratch/c-panz/HL60/NanoCaller_cluster/args

2021-03-08 23:48:01.511063: Commands for running NanoCaller on contigs in whole genome are saved in the file /fastscratch/c-panz/HL60/NanoCaller_cluster/wg_commands
Running 300 jobs using 32 workers in parallel.

ls: cannot access /fastscratch/c-panz/HL60/NanoCaller_cluster/intermediate_files/*/*snps.vcf.gz: No such file or directory
Writing to /tmp/bcftools-sort.vgPfzh
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

"NanoCaller_cluster/HL60.NanoCaller.cluster.log" 31L, 1436C                                                                            1,0-1         Top
(NanoCaller) [c-panz@sumner031 HL60]$ ls -lh NanoCaller_cluster
total 384K
-rw-r--r--   1 c-panz jaxuser  565 Mar  8 23:48 args
-rw-r--r--   1 c-panz jaxuser   28 Mar  8 23:56 HL60.final.vcf.gz
-rw-r--r--   1 c-panz jaxuser   72 Mar  8 23:56 HL60.final.vcf.gz.tbi
-rw-r--r--   1 c-panz jaxuser   28 Mar  8 23:56 HL60.indels.vcf.gz

2021-03-08 23:48:01.505590: Starting NanoCaller.

Running arguments are saved in the following file: /fastscratch/c-panz/HL60/NanoCaller_cluster/args

2021-03-08 23:48:01.511063: Commands for running NanoCaller on contigs in whole genome are saved in the file /fastscratch/c-panz/HL60/NanoCaller_cluster/wg_commands
Running 300 jobs using 32 workers in parallel.

ls: cannot access /fastscratch/c-panz/HL60/NanoCaller_cluster/intermediate_files/*/*snps.vcf.gz: No such file or directory
Writing to /tmp/bcftools-sort.vgPfzh
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

umahsn commented 3 years ago

Hi,

I have added some functionality for detailed logging in commit 39fcb5b. Can you rerun with the updated code and see if it helps you figure out what's going wrong?

PanZiwei commented 3 years ago

@umahsn Hi, I submit my new job with the latest code and it is still running. I will let you know once it is done. BTW I found you forget to add " \ " in your Whole Genome Variant Calling part in your usage.md file, the current code will raise the error saying

NanoCaller_WGS.py: error: the following arguments are required: -bam/--bam, -ref/--ref, -prefix/--prefix
#/cm/local/apps/slurm/var/spool/job6627251/slurm_script: line 29: -bam: command not found

I fixed the error by changing the script with the below command:

python PATH_TO_NANOCALLER_REPOSITORY/scripts/NanoCaller_WGS.py \    # "\" is important here
-bam YOUR_BAM \
-ref YOUR_REF \
-prefix PREFIX \
-o OUTPUT_DIRECTORY \
-cpu NUMBER_OF_CPUS_TO_USE \   # "\" is important here
-seq SEQUENCING_TYPE

Also, in the case_study file you showed that --exclude_bed hg38 in the script, but in the usage.me it didn't.

What is --exclude_bed hg38 used for? Is it necessary in the script?

PanZiwei commented 3 years ago

Update: I still got the same error:

Writing to /tmp/bcftools-sort.lOj54d
ls: cannot access /fastscratch/c-panz/HL60/NanoCaller/intermediate_files/*/*final.vcf.gz: No such file or directory
Could not read the file: -
[E::bcf_hdr_read] Input is not detected as bcf or vcf format

VARIANT CALLING FAILED. Please check any errors printed above, or in the log files here: /fastscratch/c-panz/HL60/NanoCaller/logs.
2021-03-10 16:50:06.318414: Total Time Elapsed: 357.39 seconds

Here is the log files also(I use the chr1_1_10000000 file as an example). So it seems that there is something wrong with the coverage?

2021-03-10 16:44:10.266475: Starting NanoCaller.

Running arguments are saved in the following file: /fastscratch/c-panz/HL60/NanoCaller/intermediate_files/chr1_1_10000000/args

2021-03-10 16:44:26.339259: SNP calling started.
depth: invalid option -- 'G'

Usage: samtools depth [options] in1.bam [in2.bam [...]]
Options:
   -a                  output all positions (including zero depth)
   -a -a (or -aa)      output absolutely all positions, including unused ref. sequences
   -b <bed>            list of positions or regions
   -f <list>           list of input BAM filenames, one per line [null]
   -l <int>            read length threshold (ignore reads shorter than <int>) [0]
   -d/-m <int>         maximum coverage depth [8000]. If 0, depth is set to the maximum
                       integer value, effectively removing any depth limit.
   -q <int>            base quality threshold [0]
   -Q <int>            mapping quality threshold [0]
   -r <chr:from-to>    region
      --input-fmt-option OPT[=VAL]
               Specify a single input file format option in the form
               of OPTION or OPTION=VALUE
      --reference FILE
               Reference sequence FASTA FILE [null]

The output is a simple tab-separated table with three columns: reference name,
position, and coverage depth.  Note that positions with zero coverage may be
omitted by default; see the -a option.

2021-03-10 16:44:26.495367: Coverage=0.00x.
2021-03-10 16:44:26.495402: No coverage found for the contig chr1.

2021-03-10 16:44:26.495421: SNP calling completed for contig chr1. Time taken= 0.1562

2021-03-10 16:44:26.495453: Total Time Elapsed: 16.26 seconds

umahsn commented 3 years ago

What version of samtools are you using? Samtools >=1.10 has parameter -G for samtools depth (samtools manual) but versions <= 1.9 do not have this.

PanZiwei commented 3 years ago

I am using Samtools 1.9. So I should switch to Samtools 1.10 according to environment.yml?

umahsn commented 3 years ago

Yes, that should solve the problem.

PanZiwei commented 3 years ago

Yes, that should solve the problem.

I can't install the environment by using conda env create -f environment.yml with normal conda, it will raise a lot of Package conflicts error. Instead, I am installing the environment by manually install the package in the environment.yml by creating an environment with python 3.6, install samtools v1.10. But I can't install the specific version since it is extremely slow with a lot of effort on Solving conflicts, and finally raise the UnsatisfiableError. But everything is fine with samtools v1.9... Any other way to install NanoCaller?

conda create -n NanoCaller python=3.6
conda activate NanoCaller

conda install -y -c bioconda samtools==1.10 #can't work 
conda install  -y -c bioconda samtools==1.9 #work

umahsn commented 3 years ago

Here is a quick fix. Remove -G option from the samtools command which calculates coverage. -G option's only purpose is to include/exclude supplementary reads from coverage calculation. Removing this option doesn't change the coverage calculation by much (in the case study example I checked on chr22 where the coverage changes from 51.7 to 53.4 by removing the option).

So you can replace line 33 in snpCaller.py with the following:

stream = run_cmd("samtools depth %s -r %s:%d-%d %s|awk '{if ($3>=%d){sum+=$3; cnt+=1}}END{if(cnt>0){print sum/cnt}else{print 0}}'" %(params['sam_path'], chrom, start, end, regions, params['mincov']), output= True)

After this, you can just use samtools=1.9.

In the meanwhile, I am looking into using singularity for NanoCaller installation.

PanZiwei commented 3 years ago

@umahsn I tried to switch to singularity installation and it is running now. See the command below(May be helpful if you want to add singularity version):

# load singularity
module load singularity
# Pull images from Docker Hub (Convert docker image to Singularity format)
# command: singularity pull docker://[owner]/[image]:[tag]
singularity pull docker://genomicslab/nanocaller:0.3.3
singularity run nanocaller_0.3.3.sif python NanoCaller_WGS.py

~~BTW the conda version on my end is extremely slow. I will wait for the singularity one and keep you posted.~~

Update: I finally successfully run it with the latest code(Yeah!!)

Since there are multiple result files. Can you explain more about the following files? Which one should I use if I want to check SNP in a specific cell line? Thanks!

(base) [c-panz@sumner-log2 NanoCaller]$ ls
test.final.vcf.gz.tbi  test.indels.vcf.gz.tbi  test.snps.phased.vcf.gz        test.snps.vcf.gz   
test.final.vcf.gz      test.indels.vcf.gz      test.snps.phased.vcf.gz.tbi    test.snps.vcf.gz.tbi

umahsn commented 3 years ago

*snps.vcf.gz contains unphased SNP calls made by NanoCaller using a deep learning model.
snps.phased.vcf.gz contains SNP calls from snps.vcf.gz that are phase with WhatsHap. By default they have the same genotype as in the *snps.vcf.gz file, unless --enable_whatshap flag is set which can allow WhatsHap to change genotypes.
*indels.vcf.gz contains indel calls made by NanoCaller using multiple sequence alignment. Some of these calls might be indels combined with nearby substitutions or multi-nucleotide substitutions.
final.vcf.gz contains SNP calls from snps.phased.vcf.gz and indel calls from indels.vcf.gz. Before combining variants from the two files, some variants from indels.vcf.gz file are decomposed to extract SNP calls that are within 'indel' calls. The reason for this is that some SNPs cannot be detected during alignment pileup if they are located around an indel, and are thus only visible when the indel sequence is resolved properly.

I would recommend using *final.vcf.gz since it contains the largest number of SNPs. If you want to remove indel calls from this file, simply use the following command: rtg vcffilter -i test.final.vcf.gz --snps-only -o test.final.SNPs_only.vcf.gz

Also, if you are mostly interested in SNP calls, try running NanoCaller in --mode snps_unphased or --mode snps modes. These modes are several times faster than the default mode --mode both which does both SNP and indel calling. There is an insignificant change in SNP calling accuracy if you don't obtain additional SNPs from indel calling (since doing so can introduce more true positives and false positives) but you might miss out on a few SNPs.

This table shows NanoCaller runtime (in hours) for different modes using 16 CPUs.

umahsn commented 3 years ago

And thank you very much for showing how NanoCaller would work with Singularity, I'll definitely add this as well.

PanZiwei commented 3 years ago

Thank you so much for your help! Is there a default cutoff for coverage for SNP detection? I tried to check the final result in IGV and found the peak height is different. So the higher peak stands for more coverage? Screen Shot 2021-03-12 at 8 50 16 AM

Also, do you have any dataset/website in your mind to check specific SNPs based on GRCh38 to identify cell line? I found CCLE provides SNP information, but they are based on GRCh37.

Or maybe a more general question, can I use NanoCaller to identify cell-specific SNPs for cell line identification? If so, how?

Actually, our Nanopore data is a little wired, so I am using NanoCaller with GRCh38/GRCh37 alignment to make sure our cell line identification is correct. So right now my strategy is to use Minimap2 and Samtools to generate bam files, then use NanoCaller to call SNP, and check SNP result to see whether cell line-specific SNPs appears or not. Do you think it will work?

I tried NanoCaller with GRCh37(the link I used: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.fna.gz), but it seems that NanoCaller failed to deal with the chr-prefixes with the error VARIANT CALLING FAILED due to lack of suitable contigs. Please check if the contig names specified are consistent and present in the reference genome, BAM file and any --include_bed file.. However, most GRCh37 versions did not use chr-prefixes(See below). Update: I found that NanoCaller can deal with 1-22 directly, I will try to find a version that begin with 1-22 instead of NC_.

(base) [c-panz@sumner-log1 reference]$ head GRCh37.fa
>NC_000001.10 Homo sapiens chromosome 1, GRCh37.p13 Primary Assembly
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

WGLab / NanoCaller

Fail to do variant calling #10