WGLab / NanoCaller

Variant calling tool for long-read sequencing data
MIT License
90 stars 8 forks source link

what do you want to fill in the “--chrom” option? I read the instructions and there is no document. #35

Open oranges7 opened 1 year ago

oranges7 commented 1 year ago

I want to call on all chromosomes. How should I write them?

umahsn commented 1 year ago

Hi,

What version of NanoCaller are you using? From v3 onwards the default it all chromosomes. For versions before that you needed a separate command NanoCaller_WGS for whole genome.

oranges7 commented 1 year ago

Hi,

I should be using the latest version, because I installed it with bioconda this morning, but it requires-- chrom

usage: NanoCaller [-h] [-mode MODE] [-seq SEQUENCING] [-cpu CPU] [-mincov MINCOV] [-maxcov MAXCOV] [-keep_bam] [-o OUTPUT] [-prefix PREFIX] [-sample SAMPLE] [-include_bed INCLUDE_BED] [-exclude_bed EXCLUDE_BED] [-start START] [-end END] [-p PRESET] -bam BAM -ref REF -chrom CHROM [-snp_model SNP_MODEL] [-min_allele_freq MIN_ALLELE_FREQ] [-min_nbr_sites MIN_NBR_SITES] [-nbr_t NEIGHBOR_THRESHOLD] [-sup] [-indel_model INDEL_MODEL] [-ins_t INS_THRESHOLD] [-del_t DEL_THRESHOLD] [-win_size WIN_SIZE] [-small_win_size SMALL_WIN_SIZE] [-impute_indel_phase] [-phase_bam] [-enable_whatshap] NanoCaller: error: the following arguments are required: -bam/--bam, -ref/--ref, -chrom/--chrom

I want to calling on the whole genome, how should I do it? Thank you for your reply.

oranges7 commented 1 year ago

Hi,

If possible, can you give me the code of the training process? I want to use more data to train the model. Look forward to your reply

umahsn commented 1 year ago

Hi,

It is possible the conda is still installing the older version by default. You can specify a version for conda to install: conda install -c bioconda nanocaller=3.0.1. You can check the version of NanoCaller installed via NanoCaller --version or conda list.

If you are using NanoCaller v3 and above, please use NanoCaller --bam BAM --ref REF with the rest of the options for whole genome variant calling. You do not need to specify any chromosomes or regions for whole genome.

If you want to use the NanoCaller version below 3, which it seems like you have installed, please use NanoCaller_WGS --bam BAM --ref REF with the rest of the options. In version 2 of NanoCaller, we have a command NanoCaller_WGS for whole genome variant calling which calls NanoCaller command on each chromosome.

But in version 3 of NanoCaller, we have combined the functionality into just one entry point so now in v3+, NanoCaller command does whole genome variant calling by default, and will only do chromosome specific variant calling if --regions option is used.

We will release the training code in the next week or two. I will post an update here when I do so.

oranges7 commented 1 year ago

Hi,

Thank you very much for your reply. I have updated the version. Thank you for your patient answer. I look forward to your training code.

oranges7 commented 1 year ago

Hi,

I encountered another problem, snp calls will be completed directly, generating an empty file. Because I'm running other programs, is it because I don't have enough memory? How can I correct it? Looking forward to your reply.

2023-03-24 15:58:07.022608: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /media/usb3/gyc/vl/nano_hg004_30x/args

SNP Calling Progress: 0%| | 0/6362 [00:38<?, ?it/s]

2023-03-24 15:59:01.451540: Combining SNP calls.

2023-03-24 15:59:01.453847: Compressing and indexing SNP calls. Writing to /tmp/bcftools.fnzF04 Merging 0 temporary files Cleaning Done

oranges7 commented 1 year ago

Hi,

I changed the server and can run normally, but why is the result compared with the benchmark file with hap.py, the final result will be very bad?

NanoCaller --bam /home/gyc/ont/fastq/HG001_NBT2018_Guppy_4.2.2.fastq.gz.bam \ --ref /home/gyc/ont/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna \ --cpu 40 \ --output /home/gyc/ont/nano \ --preset ont

Looking forward to your reply.

oranges7 commented 1 year ago

Hi,

I am still looking forward to your process of raising features and training models. Thank you very much for your hard work.

umahsn commented 1 year ago

Hi,

Sorry for the delay, we are working on releasing the training code, and will need a couple of weeks to get it done because we are trying to finish some other projects at the moment.

Best, Umair

oranges7 commented 1 year ago

Hi,

Thank you for your reply in your busy schedule! I have encountered some questions. I wonder if you can answer them. “”” (nano) [gyc@six170sever hg004]$ NanoCaller --bam hg004_v4_30x.bam --ref /home/gyc/ont/seq/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output /media/usb3/gyc/vl/nano_try --cpu 10 --mode snps --phase --sequencing ont

2023-05-04 09:05:21.702487: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /media/usb3/gyc/vl/nano_try/args

SNP Calling Progress: 0%| | 0/6362 [00:30<?, ?it/s]

2023-05-04 09:05:54.798367: Combining SNP calls.

2023-05-04 09:05:54.801669: Compressing and indexing SNP calls. Writing to /tmp/bcftools.CyNo4Y Merging 0 temporary files Cleaning Done

2023-05-04 09:05:54.975464: SNP calling completed. Time taken= 30.9631 “””

The steps “snp calling” when I ran nanocalller did not seem to run and were completed directly. Do you know what the possible reason is?

oranges7 commented 1 year ago

Hi,

I changed the server and ran it with the following error. “”” (nano) [ywenjing@localhost hg004]$ NanoCaller --bam hg004_v4_30x.bam --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output /mnt/ywenjing/gyc/hg004/out/ --cpu 20 --preset ont

2023-05-04 10:56:38.121256: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /mnt/ywenjing/gyc/hg004/out/args

SNP Calling Progress: 39%|????????????????????????????????? | 2493/6362 [29:50<29:41, 2.17it/s]Process Process-8: Traceback (most recent call last): File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/snpCaller.py", line 80, in caller pos, test_ref, x_test, dp, freq, coverage = get_snp_testing_candidates(params, chunk) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/generate_SNP_pileups.py", line 136, in get_snp_testing_candidates r=ref_dict[v_pos] KeyError: 21950001 “”” Then it will quit snp calling and run indel calling, but will also report an error.

“”” Process Process-31: Traceback (most recent call last): File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/indelCaller.py", line 200, in caller indel_run(params, indel_dict, job_Q, counter_Q, indel_files_list) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/indelCaller.py", line 59, in indel_run pos, x0_test, x1_test, x2_test, alleles_seq, phase = get_indel_testing_candidates(params, chunk) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/generate_indel_pileups.py", line 174, in get_indel_testing_candidates ref_dict={j:s.upper() if s in 'AGTC' else 'N' for j,s in zip(range(max(1,start-200),end+400+1),fastafile.fetch(chrom,max(1,start-200)-1,end+400)) } File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch KeyError: "sequence 'chr14' not present"

“”” Is it because of my bam file problem? I use HG004 subsampled to 30x.

umahsn commented 1 year ago

It seems like the problem is that the BAM file has different names for chromosomes than the reference FASTA file. Can you check the header of the BAM file and see if the reference chromosome names agree with the ones in FASTA file or the FASTA index file?

On Wed, May 3, 2023, 10:53 PM oranges7 @.***> wrote:

Hi,

I changed the server and ran it with the following error. “”” (nano) @.*** hg004]$ NanoCaller --bam hg004_v4_30x.bam --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output /mnt/ywenjing/gyc/hg004/out/ --cpu 20 --preset ont

2023-05-04 10:56:38.121256: Starting NanoCaller.

NanoCaller command and arguments are saved in the following file: /mnt/ywenjing/gyc/hg004/out/args

SNP Calling Progress: 39%|????????????????????????????????? | 2493/6362 [29:50<29:41, 2.17it/s]Process Process-8: Traceback (most recent call last): File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/snpCaller.py", line 80, in caller pos, test_ref, x_test, dp, freq, coverage = get_snp_testing_candidates(params, chunk) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/generate_SNP_pileups.py", line 136, in get_snp_testing_candidates r=ref_dict[v_pos] KeyError: 21950001 “”” Then it will quit snp calling and run indel calling, but will also report an error.

“”” Process Process-31: Traceback (most recent call last): File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ywenjing/anaconda3/envs/nano/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/indelCaller.py", line 200, in caller indel_run(params, indel_dict, job_Q, counter_Q, indel_files_list) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/indelCaller.py", line 59, in indel_run pos, x0_test, x1_test, x2_test, alleles_seq, phase = get_indel_testing_candidates(params, chunk) File "/home/ywenjing/anaconda3/envs/nano/bin/nanocaller_src/generate_indel_pileups.py", line 174, in get_indel_testing_candidates ref_dict={j:s.upper() if s in 'AGTC' else 'N' for j,s in zip(range(max(1,start-200),end+400+1),fastafile.fetch(chrom,max(1,start-200)-1,end+400)) } File "pysam/libcfaidx.pyx", line 301, in pysam.libcfaidx.FastaFile.fetch KeyError: "sequence 'chr14' not present"

“”” Is it because of my bam file problem? I use HG004 subsampled to 30x.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/NanoCaller/issues/35#issuecomment-1534061640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIRI4S3TXJOYM5AAYEWFT5LXEMR3LANCNFSM6AAAAAAV57VKMI . You are receiving this because you commented.Message ID: @.***>

oranges7 commented 1 year ago

Yes, I re-checked the index file and found that the file was corrupted. I fixed the file and the program could run normally. But the file on the other server is intact, but it doesn't work properly. I'm looking for a solution. Thank you for your reply!

oranges7 commented 1 year ago

Hi, I'm sorry to bother you again, but I'd like to know what kind of label you use in your training model. Can you give me a brief description?

oranges7 commented 1 year ago

Another question is, is the v_pos in the generate_indel_pileups.py file the last detected site?

umahsn commented 1 year ago

Hi,

I have added the training code here: https://github.com/WGLab/NanoCaller/tree/master/misc/training This code uses tensorflow 1.13 for training, but later we switched to tensorflow 2 and I only move the inference code to tf2 and not the training code. I will make another update with tf2 code for training. Use ground truth SNP VCF file for SNP model training and ground truth indel VCF file for indel training.

In the inference code, v_pos is the current position being scanned, prev is the last candidate site plus some buffer. Please note that the equivalent code for training is a little different.

oranges7 commented 1 year ago

Hi,

I used nanocaller to run out of results, using hap.py for analysis, found that the precision is very poor. I am using Guppy4.2.2-based hg004 data, down to 50x, using the following command NanoCaller --bam hg004_50x.bam --ref GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --output nano --cpu 20 --preset ont What adjustments or attempts should I make to improve precision? Am I going to add the "--exclude_bed" command?

umahsn commented 1 year ago

Hi,

Can you give some information about how you are evaluating the performance, such as: 1) what benchmark dataset are you using? 2) are you using a high confidence interval for evaluation? 3) are you separating SNP and indel performance? If yes, is the precision poor for SNP or indels?

Using --exclude_bed with hg19/hg38 parameter will skip telomere and centromere regions from variant calling so that could improve precision, but usually high confidence intervals in benchmarking dont include these regions for evaluation anyway. So it depends upon whether you are using a high confidence interval for evaluation or not.

You can also try to increase '--min_allele_freq' parameter to 0.2 or 0.25 (instead of default of 0.15), and --mincov parameter to 8 (instead of default of 4); both of these will improve precision.

oranges7 commented 1 year ago

Hi,

1) The benchmark dataset I use is HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz. 2) I didn't use a high confidence interval. 3) According to hap.py, the accuracy of SNP is about 55%. The accuracy of indel is about 0.06%. I'll try to run the program again with the '--min_allele_freq' parameter.

Thank you for your reply.

umahsn commented 1 year ago

You should use a high confidence interval such as: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark_noinconsistent.bed. Without high-confidence interval, you can end up with regions that do not have reliable SNP calls, and SNP calls from short-read sequencing for such regions are typically removed from GIAB benchmark datasets (hence you'll get false positives but not a lot of false negatives from NanoCaller).

Whereas for indels, it is ideal to remove homopolymer regions from evaluation since Nanopore sequencing is not reliable for homopolymer region (especially R9.4.1 datasets).

Can you try using the evaluation strategy shown here (https://github.com/WGLab/NanoCaller/blob/master/docs/ONT%20Case%20Study.md) which uses RTG-tool's vcfeval function? You can just replace HG002 links with HG004 links from here: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG004_NA24143_mother/NISTv4.2.1/GRCh38/

oranges7 commented 1 year ago

Hi,

I tried to use a high confidence interval. The results of rtgtools and hap.py are similar and are consistent with yours. The key is to look at the data with a quality value above 120. Thank you very much for your guidance.