bioinform / neusomatic

NeuSomatic: Deep convolutional neural networks for accurate somatic mutation detection
Other
168 stars 51 forks source link

AssertionError on fast_file.fetch() #55

Closed GodloveD closed 4 years ago

GodloveD commented 4 years ago

I'm trying to run a the following command:

preprocess.py \
    --mode train \
    --reference ${refGenome} \
    --region_bed ${bed} \
    --tumor_bam ${input_dir}/syntheticTumor.bam \
    --normal_bam ${input_dir}/syntheticNormal.bam \
    --work ${input_dir}/work_train_2 \
    --truth_vcf ${input_dir}/synthetic_snvs.vcf \
    --min_mapq 10 \
    --num_threads 20 \
    --scan_alignments_binary ${NEUSOMATIC_SCAN_ALIGNMENTS}

And it is giving me the error messages below. I'm assuming this is because the data that I'm using are not yielding any results and the fasta files are not being created? But maybe it is something totally different. Could someone comment?

It also might be useful to know that this is running using this container from DockerHub and I'm running it with Singularity.

Any help is appreciated. Thanks!

[...snip]
INFO 2020-02-04 14:03:29,122 find_records (ForkPoolWorker-59) Start find_records for worker 18
INFO 2020-02-04 14:03:29,125 find_records (ForkPoolWorker-60) Start find_records for worker 19
ERROR 2020-02-04 14:03:29,780 find_records (ForkPoolWorker-45) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,780 find_records (ForkPoolWorker-45) 
ERROR 2020-02-04 14:03:29,788 find_records (ForkPoolWorker-47) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,788 find_records (ForkPoolWorker-47) 
ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-49) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-49) 
ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-60) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,790 find_records (ForkPoolWorker-60) 
ERROR 2020-02-04 14:03:29,796 find_records (ForkPoolWorker-55) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,796 find_records (ForkPoolWorker-55) 
ERROR 2020-02-04 14:03:29,796 find_records (ForkPoolWorker-42) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,797 find_records (ForkPoolWorker-42) 
ERROR 2020-02-04 14:03:29,798 find_records (ForkPoolWorker-41) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,799 find_records (ForkPoolWorker-41) 
ERROR 2020-02-04 14:03:29,803 find_records (ForkPoolWorker-51) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,803 find_records (ForkPoolWorker-51) 
ERROR 2020-02-04 14:03:29,805 find_records (ForkPoolWorker-57) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,805 find_records (ForkPoolWorker-57) 
ERROR 2020-02-04 14:03:29,808 find_records (ForkPoolWorker-58) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,809 find_records (ForkPoolWorker-58) 
ERROR 2020-02-04 14:03:29,809 find_records (ForkPoolWorker-48) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,809 find_records (ForkPoolWorker-56) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,810 find_records (ForkPoolWorker-56) 
ERROR 2020-02-04 14:03:29,810 find_records (ForkPoolWorker-48) 
ERROR 2020-02-04 14:03:29,812 find_records (ForkPoolWorker-46) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,813 find_records (ForkPoolWorker-46) 
INFO 2020-02-04 14:03:29,845 find_records (ForkPoolWorker-53) N_none: 263 
INFO 2020-02-04 14:03:29,845 find_records (ForkPoolWorker-54) N_none: 239 
INFO 2020-02-04 14:03:29,846 find_records (ForkPoolWorker-50) N_none: 272 
INFO 2020-02-04 14:03:29,847 find_records (ForkPoolWorker-43) N_none: 250 
ERROR 2020-02-04 14:03:29,855 find_records (ForkPoolWorker-44) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,855 find_records (ForkPoolWorker-44) 
ERROR 2020-02-04 14:03:29,855 find_records (ForkPoolWorker-59) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,856 find_records (ForkPoolWorker-59) 
ERROR 2020-02-04 14:03:29,929 find_records (ForkPoolWorker-52) Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1055, in find_records
    mt2, eqs2 = push_lr(fasta_file, mt, 2)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 728, in push_lr
    assert(fasta_file.fetch((c), p - 1, p - 1 + len(r)).upper() == r)
AssertionError

ERROR 2020-02-04 14:03:29,930 find_records (ForkPoolWorker-52) 
ERROR 2020-02-04 14:03:29,931 __main__             Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 435, in <module>
    args.scan_alignments_binary)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 335, in preprocess
    ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 129, in generate_dataset_region
    tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset
    raise Exception("find_records failed!")
Exception: find_records failed!

ERROR 2020-02-04 14:03:29,931 __main__             Aborting!
ERROR 2020-02-04 14:03:29,931 __main__             preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, filter_duplicate=False, first_do_without_qual=False, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, max_dp=100000, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=1, mode='train', normal_bam='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/syntheticNormal.bam', num_threads=20, reference='/data/godlovedc/slurm-job/hg38.fa', region_bed='/data/godlovedc/slurm-job/broad_MDA_mocha_overlap_cds.bed', restart=False, scan_alignments_binary='/opt/neusomatic/neusomatic/bin/scan_alignments', scan_maf=0.01, scan_window_size=2000, snp_min_af=0.05, snp_min_ao=3, snp_min_bq=10, truth_vcf='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/synthetic_snvs.vcf', tsv_batch_size=50000, tumor_bam='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/syntheticTumor.bam', work='/data/godlovedc/slurm-job/hapmap_output_multi_mda_snv/work_train_2')
Traceback (most recent call last):
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 441, in <module>
    raise e
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 435, in <module>
    args.scan_alignments_binary)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 335, in preprocess
    ensemble_beds[i] if ensemble_tsv else None, tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python//preprocess.py", line 129, in generate_dataset_region
    tsv_batch_size)
  File "/opt/neusomatic/neusomatic/python/generate_dataset.py", line 1461, in generate_dataset
    raise Exception("find_records failed!")
Exception: find_records failed!
msahraeian commented 4 years ago

@GodloveD Happy to see your interest in NeuSomatic. What is the reference fasta you are using, here?

msahraeian commented 4 years ago

Also, can you check whether on you truth_vcf file, you have small letters in REF or ALT fields?

GodloveD commented 4 years ago

Thanks for the speedy reply @msahraeian! I'm actually a staff scientist and I'm working to help debug this on behalf of another scientist. The reference that we are using is hg38.fa. I will need to ask where it was obtained. As for the truth_vcf file, the REF field appears to contain nothing but dots (.) while the ALT field does have some lowercase letters as you suspected. Is this an issue? Thanks!

msahraeian commented 4 years ago

@GodloveD Yes, that's the issue. In the truth VCF you should have the actual reference and alternative bases in the REF and ALT columns. You need to fix the VCF. For instance:

GodloveD commented 4 years ago

Thanks again @msahraeian. We've changed all of the lowercase letters in the REF and ALT columns to upper case and it seems to be running now.

I was (obviously) mistaken. The dots appear in the ID field not the REF field.

FWIW, we are running this analysis on whole exome sequencing data instead of whole genome sequencing data.

Thanks again for the help! 😺