bioinform / neusomatic

NeuSomatic: Deep convolutional neural networks for accurate somatic mutation detection
Other
167 stars 51 forks source link

scan_alignment fails #10

Closed vendig closed 5 years ago

vendig commented 6 years ago

Hi. i am running dream challenge dataset 3 and scan_alignment fails. What could be an issue? Thank you.

input: %sh export PYTHONPATH="/databricks/python/local/lib/python2.7/site-packages:$PYTHONPATH" # this is just a hack, don't use this code. cd /local_disk0/neu2/neusomatic-master/test/example

Stand-alone NeuSomatic test

/usr/bin/python ../../neusomatic/python/preprocess.py \ --mode call \ --reference /local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.fa \ --region_bed /local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.region.bed \ --tumor_bam /dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.tumor.bam \ --normal_bam /dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.normal.bam \ --work work_standalone \ --scan_maf 0.05 \ --min_mapq 10 \ --snp_min_af 0.05 \ --snp_min_bq 20 \ --snp_min_ao 10 \ --ins_min_af 0.05 \ --del_min_af 0.05 \ --num_threads 1 \ --scan_alignments_binary ../../neusomatic/bin/scan_alignments

output: ERROR 2018-09-21 11:28:03,988 run_scan_alignments (PoolWorker-1) Command '['../../neusomatic/bin/scan_alignments', '--ref', '/local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.fa', '-b', '/dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.tumor.bam', '-L', 'work_standalone/work_tumor_without_q/region_3731.bed', '--out_vcf_file', 'work_standalone/work_tumor_without_q/work.3731/candidates.vcf', '--out_count_file', 'work_standalone/work_tumor_without_q/work.3731/count.bed', '--window_size', '2000', '--min_af', '0.05', '--min_mapq', '10', '--num_thread', '1']' returned non-zero exit status -11 ERROR 2018-09-21 11:28:03,989 run_scan_alignments (PoolWorker-1) Please check error log at work_standalone/work_tumor_without_q/work.3731/scan.err ERROR 2018-09-21 11:28:03,989 main Traceback (most recent call last): File "../../neusomatic/python/preprocess.py", line 389, in args.scan_alignments_binary) File "../../neusomatic/python/preprocess.py", line 234, in preprocess calc_qual=False, dbsnp_regions=[]) File "../../neusomatic/python/preprocess.py", line 50, in process_split_region calc_qual=calc_qual) File "/local_disk0/neu2/neusomatic-master/neusomatic/python/scan_alignments.py", line 136, in scan_alignments raise Exception("scan_alignments failed!") Exception: scan_alignments failed!

ERROR 2018-09-21 11:28:03,989 main Aborting! ERROR 2018-09-21 11:28:03,990 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.normal.bam', num_threads=1, reference='/local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.fa', region_bed='/local_disk0/neu2/neusomatic-master/dream_dataset/Homo_sapiens.GRCh38.dna.toplevel.region.bed', restart=False, scan_alignments_binary='../../neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/dbfs/mnt/shenli9/neu-vendi/synthetic.challenge.set3.tumor.bam', work='work_standalone') Traceback (most recent call last): File "../../neusomatic/python/preprocess.py", line 395, in raise e Exception: scan_alignments failed!

log.docx

roger-liu-bina commented 5 years ago

@vendig Thanks for reporting. Can you share us the log like this work_standalone/work_tumor_without_g/work.0/scan.err

vendig commented 5 years ago

Output from work_standalone/work_tumor_without_g/work.0/scan.err:

../../neusomatic/bin/scan_alignments --ref /local_disk0/neu2/dream_dataset/chr19.fa -b /dbfs/mnt/shenli9/neu-vendi/chr19.tumor.bam -L work_standalone/work_tumor_without_q/region_2.bed --out_vcf_file work_standalone/work_tumor_without_q/work.2/candidates.vcf --out_count_file work_standalone/work_tumor_without_q/work.2/count.bed --window_size 2000 --min_af 0.05 --min_mapq 10 --num_thread 1

the bam file and the reference do not match. exit.. please check the bam header and the reference file.

Okay the error is clear now. I did with two kind of datasets and always got the same error. Might the problem be that sometimes bam and reference files' heads are written differently like (chr1, chr01)?

roger-liu-bina commented 5 years ago

You can use samtools -H to check the bam header. Several things to check is that the order and names in the bam have to match those in the reference. The bottom line is that you should use the same reference which was used for alignment.

vendig commented 5 years ago

Thank you a lot. Unfortunately, scan alignment still fails. Headers of reference and bam files are matching now. An error:

Part of an error: ERROR 2018-09-28 11:18:52,602 run_scan_alignments (PoolWorker-4) Command '['../../neusomatic/bin/scan_alignments', '--ref', '/local_disk0/hg19%2FHomo_sapiens_assembly19_fromBroad.fasta', '-b', '/local_disk0/chr19.tumor.bam', '-L', 'work_standalone/work_tumor_without_q/region_119.bed', '--out_vcf_file', 'work_standalone/work_tumor_without_q/work.119/candidates.vcf', '--out_count_file', 'work_standalone/work_tumor_without_q/work.119/count.bed', '--window_size', '2000', '--min_af', '0.05', '--min_mapq', '10', '--num_thread', '1']' returned non-zero exit status -6 ERROR 2018-09-28 11:18:52,602 run_scan_alignments (PoolWorker-4) Please check error log at work_standalone/work_tumor_without_q/work.119/scan.err ERROR 2018-09-28 11:18:52,602 main Traceback (most recent call last): File "../../neusomatic/python/preprocess.py", line 389, in args.scan_alignments_binary) File "../../neusomatic/python/preprocess.py", line 234, in preprocess calc_qual=False, dbsnp_regions=[]) File "../../neusomatic/python/preprocess.py", line 50, in process_split_region calc_qual=calc_qual) File "/local_disk0/neu2/neusomatic-master/neusomatic/python/scan_alignments.py", line 136, in scan_alignments raise Exception("scan_alignments failed!") Exception: scan_alignments failed!

ERROR 2018-09-28 11:18:52,602 main Aborting! ERROR 2018-09-28 11:18:52,603 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/local_disk0/chr19.normal.bam', num_threads=32, reference='/local_disk0/hg19%2FHomo_sapiens_assembly19_fromBroad.fasta', region_bed='/dbfs/mnt/grobeve1/hg19%2FHomo_sapiens_assembly19_fromBroad.bed', restart=False, scan_alignments_binary='../../neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/local_disk0/chr19.tumor.bam', work='work_standalone') Traceback (most recent call last): File "../../neusomatic/python/preprocess.py", line 395, in raise e Exception: scan_alignments failed!

Output of: work_standalone/work_tumor_without_q/work.242/scan.err

../../neusomatic/bin/scan_alignments --ref /local_disk0/hg19%2FHomo_sapiens_assembly19_fromBroad.fasta -b /local_disk0/chr19.tumor.bam -L work_standalone/work_tumor_without_q/region_242.bed --out_vcf_file work_standalone/work_tumor_without_q/work.242/candidates.vcf --out_count_file work_standalone/work_tumor_without_q/work.242/count.bed --window_size 2000 --min_af 0.05 --min_mapq 10 --num_thread 1

On region 8:? [13066132,13068132)

Aligned read number: 0

On region 8:? [13266131,13268132)

Aligned read number: 0

On region 8:? [13466131,13468132)

Aligned read number: 0

On region 8:? [13666131,13668132)

Aligned read number: 0

On region 8:? [13866131,13868132)

Aligned read number: 0

On region 8:? [14066131,14068132)

Aligned read number: 0

On region 8:? [14266131,14268132)

Aligned read number: 0

On region 8:? [14466131,14468132)

Aligned read number: 0

On region 8:? [14666131,14668132)

Aligned read number: 0

Another scan.err: work_standalone/work_tumor_without_q/work.119/scan.err

../../neusomatic/bin/scan_alignments --ref /local_disk0/hg19%2FHomo_sapiens_assembly19_fromBroad.fasta -b /local_disk0/chr19.tumor.bam -L work_standalone/work_tumor_without_q/region_119.bed --out_vcf_file work_standalone/work_tumor_without_q/work.119/candidates.vcf --out_count_file work_standalone/work_tumor_without_q/work.119/count.bed --window_size 2000 --min_af 0.05 --min_mapq 10 --num_thread 1

On region 19:? [53397679,53399679)

Aligned read number: 623

Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! Warning: no MD tag! terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::substr: __pos (which is 105) > this->size() (which is 101)

Does it mean that .bai indexes or .fai indexes does not match between reference and bam file? I see also that MD tags are needed, but I don't think this is the reason why it fails? Thank you. Best

msahraeian commented 5 years ago

@vendig I guess there are still some discrepancies in your bam file. Would you please upload the header of your bam (using samtools view -H chr19.tumor.bam) and your reference .fai file. I would also suggest check your bam file using Picard ValidateSamFile, explained here. For instance you can run java -jar picard.jar ValidateSamFile I=chr19.bam R=ref.fa MODE=SUMMARY and see if it reports any bugs.

We also need MD tag in the bam file. As explained in README, you can make sure your bam has MD tag by:

samtools calmd -@ num_threads -b alignment.bam reference.fasta  > alignment.md.bam 
samtools index alignment.md.bam
roger-liu-bina commented 5 years ago

@vendig the MD tag is needed to be able to proceed. Please run @msahraeian command first and see if this helps.

vendig commented 5 years ago

Hi. I checked both bam files (tumor and normal) with picard tool. There was no error detected (snapshots attached). After adding MD tag and running files again, new error appeared:

ERROR 2018-09-28 17:33:46,177 prep_data_single_tabix (PoolWorker-26820) could not create iterator for region 'X:104318492-104318505' ERROR 2018-09-28 17:33:46,177 prep_data_single_tabix (PoolWorker-26821) Traceback (most recent call last): File "/local_disk0/neu2/neusomatic-master/neusomatic/python/generate_dataset.py", line 574, in prep_data_single_tabix min_cov=min_cov) File "/local_disk0/neu2/neusomatic-master/neusomatic/python/generate_dataset.py", line 300, in prepare_info_matrices_tabix ref_file, normal_count_bed, record, matrix_base_pad) File "/local_disk0/neu2/neusomatic-master/neusomatic/python/generate_dataset.py", line 49, in get_variant_matrix_tabix chrom, pos - matrix_base_pad, pos + matrix_base_pad) File "pysam/libctabix.pyx", line 509, in pysam.libctabix.TabixFile.fetch ValueError: could not create iterator for region 'X:104318512-104318525'

ERROR 2018-09-28 17:33:46,177 prep_data_single_tabix (PoolWorker-26821) could not create iterator for region 'X:104318512-104318525'

ERROR 2018-09-28 17:33:46,180 main Aborting! ERROR 2018-09-28 17:33:46,180 main preprocess.py failure on arguments: Namespace(dbsnp_to_filter=None, del_merge_min_af=0, del_min_af=0.05, ensemble_tsv=None, good_ao=10, ins_merge_min_af=0, ins_min_af=0.05, long_read=False, matrix_base_pad=7, matrix_width=32, merge_r=0.5, min_ao=1, min_dp=5, min_ev_frac_per_col=0.06, min_mapq=10, mode='call', normal_bam='/local_disk0/chr19.normal.md.bam', num_threads=32, reference='/local_disk0/hg19%2FHomo_sapiens_assembly19_fromBroad.fasta', region_bed='/dbfs/mnt/grobeve1/hg19%2FHomo_sapiens_assembly19_fromBroad.bed', restart=False, scan_alignments_binary='../../neusomatic/bin/scan_alignments', scan_maf=0.05, scan_window_size=2000, skip_without_qual=False, snp_min_af=0.05, snp_min_ao=10.0, snp_min_bq=20.0, truth_vcf=None, tsv_batch_size=50000, tumor_bam='/local_disk0/chr19.tumor.md.bam', work='work_standalone') Traceback (most recent call last): File "../../neusomatic/python/preprocess.py", line 395, in raise e Exception: prep_data_single_tabix failed!

Please find as well header of files (normal, tumor and reference): header_chr19_normal.txt header_chr19_tumor.txt chromosomes.txt

Checking with piccard tool:

screen shot 2018-10-01 at 00 24 48 screen shot 2018-10-01 at 00 22 06
msahraeian commented 5 years ago

@vendig It seems the previous error has happened duo to missing MD tags. So, we are almost there :) For the new error, I tried to run your test case (Dream challenge stage 3 on chr19), but was not able to reproduce your error, and my run finished successfully. Would you please make sure you have the right pysam version (0.14.1)? Also, please make sure you delete the work folder (work_standalone) after you have fixed the MD tag problem, and do a fresh run again. Another suggestions I can have is to restrict the region_bed file to your target region so if you only care about chromosome 19, restrict it to chr19.

vendig commented 5 years ago

Hi @msahraeian . This time I restricted the region_bed file to only chromosome 19. It works!! During preprocessing still a warning popped up - WARNING 2018-10-03 10:34:31,142 prepare_info_matrices_tabix Skip ['19', 58938658, 'T', 'TCTTTC', '723'] for low cov 2<5 WARNING 2018-10-03 10:34:32,590 prepare_info_matrices_tabix Skip ['19', 58700025, 'A', 'ATG', '813'] for low cov 2<5

This might be due to a different pysam version? I have installed pysam version 0.15.0. Thank you a lot for so much support.

vendig commented 5 years ago

Please don't close an issue yet.

msahraeian commented 5 years ago

@vendig Sounds good. I still suggest to use pysam 0.14.1 to make sure about the compatibility issues. BTW, those warnings are normal and you can ignore them.

vendig commented 5 years ago

I tried as well to restrict the bed file to chr 19 and X. I changed the pysam version to 0.14.1. Error appears like before, when bed file contained information about all chromosomes:

ERROR 2018-10-04 09:13:18,241 prep_data_single_tabix (PoolWorker-724) Traceback (most recent call last): File "/local_disk0/neu2/neusomatic-master/neusomatic/python/generate_dataset.py", line 574, in prep_data_single_tabix min_cov=min_cov) File "/local_disk0/neu2/neusomatic-master/neusomatic/python/generate_dataset.py", line 300, in prepare_info_matrices_tabix ref_file, normal_count_bed, record, matrix_base_pad) File "/local_disk0/neu2/neusomatic-master/neusomatic/python/generate_dataset.py", line 49, in get_variant_matrix_tabix chrom, pos - matrix_base_pad, pos + matrix_base_pad) File "pysam/libctabix.pyx", line 499, in pysam.libctabix.TabixFile.fetch ValueError: could not create iterator for region 'X:104318492-104318505'

Do you have an explanation why this happen? I would like to know before start running a whole genome dataset.

msahraeian commented 5 years ago

@vendig can you restrict the tumor and normal bams you use to this region using samtools view -h tumor.bam X:104318492-104318505|samtools view -bS > tumor_x.bam and samtools view -h normal.bam X:104318492-104318505|samtools view -bS > normal_x.bam and send me the outputs tumor_x.bam and normal_x.bam

vendig commented 5 years ago

Hi. The above command line does not work on my samtools version (0.1.19). Unfortunately, I am not allowed to change the version at the moment. If I understand correctly the header of .bam file should be changed to X: 104318492-104318505?

msahraeian commented 5 years ago

@vendig Sorry it was my bad. The right command is samtools view -h tumor.bam X:104318492-104318505|samtools view -bS - > tumor_x.bam and samtools view -h normal.bam X:104318492-104318505|samtools view -bS - > normal_x.bam and it should work with samtools version (0.1.19) too. You don't need to change the header. Please only send me the output bams.

In general, you need to have the packages listed in the README with their right version to make sure the usability. With anaconda/miniconda you can easily have all the packages locally installed without changing the system installed libraries.

vendig commented 5 years ago

@msahraeian please find the link with the output bam files. Link: https://drive.google.com/open?id=179_3VqqE27nQnOvbIMlkQDXE9AgWMUQK

msahraeian commented 5 years ago

@vendig Thanks. I think I fixed this bug in #15. Please pull the master branch and run again.

vendig commented 5 years ago

It works perfectly! Thank you a lot.

vendig commented 5 years ago

Hi. When finally running a whole genome dataset an error appears. I am wondering whether is this a memory issue. I tried with a few cluster configurations and always ended up with an error of scan alignment failed:

From scan.err:

Error in `../../neusomatic/bin/scan_alignments': munmap_chunk(): invalid pointer: 0x0000000003192a70 ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f706e36e7e5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x1a8)[0x7f706e37b698] ../../neusomatic/bin/scan_alignments(_ZNK10neusomatic10MSABuilderIN6SeqLib9BamRecordENS_3bio7VariantINSt7cxx1112basic_stringIcSt11char_traitsIcESaIcEEEiEEE14GetMSAwithQualEv+0x13b7)[0x4cea37] ../../neusomatic/bin/scan_alignments(main+0xace)[0x4b6d7e] /lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf0)[0x7f706e317830] ../../neusomatic/bin/scan_alignments(_start+0x29)[0x4bf4f9] ======= Memory map: ======== 00400000-0065d000 r-xp 00000000 fc:00 7084986 /local_disk0/neu2/neusomatic-master/neusomatic/bin/scan_alignments 0085d000-00865000 r--p 0025d000 fc:00 7084986 /local_disk0/neu2/neusomatic-master/neusomatic/bin/scan_alignments 00865000-00866000 rw-p 00265000 fc:00 7084986 /local_disk0/neu2/neusomatic-master/neusomatic/bin/scan_alignments 00866000-00869000 rw-p 00000000 00:00 0 015bd000-031b3000 rw-p 00000000 00:00 0 [heap] 7f706d6dc000-7f706d6f2000 r-xp 00000000 103:00 6042720 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f706d6f2000-7f706d8f1000 ---p 00016000 103:00 6042720 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f706d8f1000-7f706d8f2000 rw-p 00015000 103:00 6042720 /lib/x86_64-linux-gnu/libgcc_s.so.1 7f706d8f2000-7f706db73000 rw-p 00000000 00:00 0 7f706dcb4000-7f706dcb7000 r-xp 00000000 103:00 6042714 /lib/x86_64-linux-gnu/libdl-2.23.so 7f706dcb7000-7f706deb6000 ---p 00003000 103:00 6042714 /lib/x86_64-linux-gnu/libdl-2.23.so 7f706deb6000-7f706deb7000 r--p 00002000 103:00 6042714 /lib/x86_64-linux-gnu/libdl-2.23.so 7f706deb7000-7f706deb8000 rw-p 00003000 103:00 6042714 /lib/x86_64-linux-gnu/libdl-2.23.so 7f706deb8000-7f706ded0000 r-xp 00000000 103:00 6042757 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f706ded0000-7f706e0cf000 ---p 00018000 103:00 6042757 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f706e0cf000-7f706e0d0000 r--p 00017000 103:00 6042757 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f706e0d0000-7f706e0d1000 rw-p 00018000 103:00 6042757 /lib/x86_64-linux-gnu/libpthread-2.23.so 7f706e0d1000-7f706e0d5000 rw-p 00000000 00:00 0 7f706e0d5000-7f706e0f6000 r-xp 00000000 103:00 5248162 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f706e0f6000-7f706e2f5000 ---p 00021000 103:00 5248162 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f706e2f5000-7f706e2f6000 r--p 00020000 103:00 5248162 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f706e2f6000-7f706e2f7000 rw-p 00021000 103:00 5248162 /usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0 7f706e2f7000-7f706e4b7000 r-xp 00000000 103:00 6042701 /lib/x86_64-linux-gnu/libc-2.23.so 7f706e4b7000-7f706e6b7000 ---p 001c0000 103:00 6042701 /lib/x86_64-linux-gnu/libc-2.23.so 7f706e6b7000-7f706e6bb000 r--p 001c0000 103:00 6042701 /lib/x86_64-linux-gnu/libc-2.23.so 7f706e6bb000-7f706e6bd000 rw-p 001c4000 103:00 6042701 /lib/x86_64-linux-gnu/libc-2.23.so 7f706e6bd000-7f706e6c1000 rw-p 00000000 00:00 0 7f706e6c1000-7f706e7c9000 r-xp 00000000 103:00 6042729 /lib/x86_64-linux-gnu/libm-2.23.so 7f706e7c9000-7f706e9c8000 ---p 00108000 103:00 6042729 /lib/x86_64-linux-gnu/libm-2.23.so 7f706e9c8000-7f706e9c9000 r--p 00107000 103:00 6042729 /lib/x86_64-linux-gnu/libm-2.23.so 7f706e9c9000-7f706e9ca000 rw-p 00108000 103:00 6042729 /lib/x86_64-linux-gnu/libm-2.23.so 7f706e9ca000-7f706e9e3000 r-xp 00000000 103:00 6042780 /lib/x86_64-linux-gnu/libz.so.1.2.8 7f706e9e3000-7f706ebe2000 ---p 00019000 103:00 6042780 /lib/x86_64-linux-gnu/libz.so.1.2.8 7f706ebe2000-7f706ebe3000 r--p 00018000 103:00 6042780 /lib/x86_64-linux-gnu/libz.so.1.2.8 7f706ebe3000-7f706ebe4000 rw-p 00019000 103:00 6042780 /lib/x86_64-linux-gnu/libz.so.1.2.8 7f706ebe4000-7f706ebf3000 r-xp 00000000 103:00 6042700 /lib/x86_64-linux-gnu/libbz2.so.1.0.4 7f706ebf3000-7f706edf2000 ---p 0000f000 103:00 6042700 /lib/x86_64-linux-gnu/libbz2.so.1.0.4 7f706edf2000-7f706edf3000 r--p 0000e000 103:00 6042700 /lib/x86_64-linux-gnu/libbz2.so.1.0.4 7f706edf3000-7f706edf4000 rw-p 0000f000 103:00 6042700 /lib/x86_64-linux-gnu/libbz2.so.1.0.4 7f706edf4000-7f706ee1a000 r-xp 00000000 103:00 6042687 /lib/x86_64-linux-gnu/ld-2.23.so 7f706f006000-7f706f00c000 rw-p 00000000 00:00 0 7f706f018000-7f706f019000 rw-p 00000000 00:00 0 7f706f019000-7f706f01a000 r--p 00025000 103:00 6042687 /lib/x86_64-linux-gnu/ld-2.23.so 7f706f01a000-7f706f01b000 rw-p 00026000 103:00 6042687 /lib/x86_64-linux-gnu/ld-2.23.so 7f706f01b000-7f706f01c000 rw-p 00000000 00:00 0 7ffc1c4ae000-7ffc1c4e7000 rw-p 00000000 00:00 0 [stack] 7ffc1c5d7000-7ffc1c5da000 r--p 00000000 00:00 0 [vvar] 7ffc1c5da000-7ffc1c5dc000 r-xp 00000000 00:00 0 [vdso] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

roger-liu-bina commented 5 years ago

@vendig I wonder which version are you running? We have some mem fix recently. Can you show us the commit id of your build?

msahraeian commented 5 years ago

@vendig please also let us know how many threads you are using and what is the per thread memory you are requesting in your cluster?

vendig commented 5 years ago

Hi. I am using the latest version. Every time the cluster is switched on, it automatically updates to the latest version (docker is already included as well as the fix for chromosome 19 and X.)

Specifications of instances. For all instances 500 Gb of SSD was allocated.

1. r4.4xlarge - 122 Gb memory, 16 Cores, 4 DBU (Databrick unit, which is a unit of precessing capability per hour). I used 32 threads. For this instance I did not get an error message shown above, but the command was running for more than 12 hours and was stuck on scan alignment.

2. r4.8xlarge - 244 Gb memory, 32 Cores, 8 DBU - I used here 32 threads Error shown above

3. r5.12xlarge - 384 Gg memory, 48 cores, 10.8 DBU - I used here 80 threads Error shown above

msahraeian commented 5 years ago

@vendig Would you please make sure your bam files have MD tags (can be produced with the command I mentioned earlier)? and when you say it is stuck can you check the content of output folder and the latest sub-folder/file created inside that?

The total run-time depends on the coverage of alignment, parameters used, and number of threads. For me whole genome 50x data on 28 threads (with max memory of 256G) can take ~6-9 hours.

Also can you check which region gives this error? You can find the region.bed associated with each scan_alignment run in the same folder you find scan.err

vendig commented 5 years ago

Hi. I have to rerun it again, since I did not save the output folder. I was also searching a bit about the error and it might not even be connected with memory issue?

It is interesting that in my case alignment scanning always fails only at the last batch of files which needs to be still scanned. If the error would be a memory issue, could it be that it fails always at the same step?

Yeah bam files has MD tags and are indexed. Will run again with 256Gb memory and 28 threads.

roger-liu-bina commented 5 years ago

@vendig Right, it might not be a mem issue. You may actually find a bug. It is a public data? Will it possible for you to share your bam?

vendig commented 5 years ago

Yeah, these are dream stage 3 datasets. Each bam file is of size 170 Gb. I have both bam files in amazon bucket. Let me know what would be the easiest way to share the files.

vendig commented 5 years ago

It failed again at the same step. Unfortunately, there was no "region.bed" inside failed "work" folder. I can also share failed "work" folder. It seems to me that only this particular one failed. One interesting information - scanning alignment of "work_tumor_without_q" was successful but "work_tumor" failed.

I downloaded bam files of dream dataset 3 from the webpage: https://console.cloud.google.com/storage/browser/public-dream-data/ (it seems to me that this webpage does not work anymore).

Data can be found as well here: https://www.synapse.org/#!Synapse:syn2280639

I was thinking whether is this error more specific to this particular dataset or can I expect it on all big datasets? And would be worth trying running this dataset without the failed region (excluding the failed region from a bed file)?

msahraeian commented 5 years ago

@vendig I have the dream dataset and I am trying to rerun to reproduce your error. One thing I suspect is that you have included all the decoy chromosomes in you region.bed and it may cause the problem (although it should not). Can you try to tun by including only primary chromosomes in your run. You can get that bed file using:

cat reference.fasta.fai | awk -F "\t" '{print $1 "\t0\t" $2}' | awk -F "\t" '$1 ~ /^(chr)?[0-9XYMT]+$/' > region.bed

Meanwhile it would be great if you could share with us the failed "work" folder. Please also share the command line arguments you use.

Are you using the current master branch of the online docker, the docker is not the latest.

vendig commented 5 years ago

Hi. I rerun it again with a bed file restricted only to primary chromosomes. It failed again, but this time already during "work_tumor_without_q" preprocessing. Please find the link to the "work" folder:

https://drive.google.com/drive/folders/1nbNWGxlHI1YbCre3TDMZS7oM7SMu75Cl?usp=sharing

Commands line used: /databricks/python/bin/python ../../neusomatic/python/preprocess.py \ --mode call \ --reference /local_disk0/Homo_sapiens_assembly19.fasta \ --region_bed /local_disk0/region_only_chr.bed \ --tumor_bam /local_disk0/synthetic.challenge.set3.tumor.md.bam \ --normal_bam /local_disk0/synthetic.challenge.set3.normal.md.bam \ --work work_standalone \ --scan_maf 0.05 \ --min_mapq 10 \ --snp_min_af 0.05 \ --snp_min_bq 20 \ --snp_min_ao 10 \ --ins_min_af 0.05 \ --del_min_af 0.05 \ --num_threads 50 \ --scan_alignments_binary ../../neusomatic/bin/scan_alignments

msahraeian commented 5 years ago

@vendig thanks. Would you please also share with me the failed work directory when you restrict to main chromosomes?

vendig commented 5 years ago

@msahraeian unfortunately, the folder was not saved. If it is very much needed, I can rerun again.

msahraeian commented 5 years ago

@vendig So, let me try running for the dataset first.

msahraeian commented 5 years ago

@vendig Thanks for catching this bug. I think we have fixed it in #18. It was happening because the regions pass the chromosome length by one base. Please pull the master branch and run again.

msahraeian commented 5 years ago

@vendig and please build the binaries again.

vendig commented 5 years ago

Hi. Scan_alignment goes through but unfortunately calling of call.py fails.

Input: CUDA_VISIBLE_DEVICES= /databricks/python/bin/python ../../neusomatic/python/call.py \ --candidates_tsv work_standalone/dataset//candidates.tsv \ --reference /local_disk0/Homo_sapiens_assembly19.fasta \ --out work_standalone \ --checkpoint ../../neusomatic/models/NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth \ --num_threads 16 \ --batch_size 100

Output: INFO 2018-10-19 04:47:29,346 main use_cuda: False INFO 2018-10-19 04:47:29,346 call_neusomatic -----------------Call Somatic Mutations-------------------- INFO 2018-10-19 04:47:41,390 call_neusomatic Load pretrained model from checkpoint ../../neusomatic/models/NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth INFO 2018-10-19 04:47:41,394 call_neusomatic tag: NeuSomatic_v0.1.0_standalone_Dream3_70purity INFO 2018-10-19 04:47:58,053 call_neusomatic Run for candidate files: ['work_standalone/dataset/work.12/candidates_0.tsv', 'work_standalone/dataset/work.298/candidates_0.tsv', 'work_standalone/dataset/work.64/candidates_0.tsv', 'work_standalone/dataset/work.75/candidates_0.tsv', 'work_standalone/dataset/work.85/candidates_0.tsv', 'work_standalone/dataset/work.296/candidates_0.tsv', 'work_standalone/dataset/work.297/candidates_0.tsv', 'work_standalone/dataset/work.270/candidates_0.tsv', 'work_standalone/dataset/work.299/candidates_0.tsv', 'work_standalone/dataset/work.159/candidates_0.tsv', 'work_standalone/dataset/work.294/candidates_0.tsv', 'work_standalone/dataset/work.271/candidates_0.tsv', 'work_standalone/dataset/work.86/candidates_0.tsv', 'work_standalone/dataset/work.63/candidates_0.tsv', 'work_standalone/dataset/work.154/candidates_0.tsv', 'work_standalone/dataset/work.13/candidates_0.tsv', 'work_standalone/dataset/work.289/candidates_0.tsv'] INFO 2018-10-19 04:47:58,253 dataloader Len's of tsv files in this batch: [0, 0, 0, 0, 0, 194, 980, 1583, 4166, 4895, 6805, 9392, 13613, 15123, 16266, 16292, 17776] INFO 2018-10-19 04:47:58,343 extract_info_tsv (PoolWorker-4) Loaded 0 candidates for work_standalone/dataset/work.75/candidates_0.tsv INFO 2018-10-19 04:47:58,344 extract_info_tsv (PoolWorker-50) Loaded 0 candidates for work_standalone/dataset/work.85/candidates_0.tsv INFO 2018-10-19 04:47:58,345 extract_info_tsv (PoolWorker-1) Loaded 0 candidates for work_standalone/dataset/work.12/candidates_0.tsv INFO 2018-10-19 04:47:58,345 extract_info_tsv (PoolWorker-2) Loaded 0 candidates for work_standalone/dataset/work.298/candidates_0.tsv INFO 2018-10-19 04:47:58,345 extract_info_tsv (PoolWorker-3) Loaded 0 candidates for work_standalone/dataset/work.64/candidates_0.tsv INFO 2018-10-19 04:47:58,350 extract_info_tsv (PoolWorker-6) Loaded 194 candidates for work_standalone/dataset/work.296/candidates_0.tsv INFO 2018-10-19 04:47:58,407 extract_info_tsv (PoolWorker-8) Loaded 980 candidates for work_standalone/dataset/work.297/candidates_0.tsv INFO 2018-10-19 04:47:58,423 extract_info_tsv (PoolWorker-7) Loaded 1583 candidates for work_standalone/dataset/work.270/candidates_0.tsv INFO 2018-10-19 04:47:58,502 extract_info_tsv (PoolWorker-10) Loaded 4895 candidates for work_standalone/dataset/work.159/candidates_0.tsv INFO 2018-10-19 04:47:58,522 extract_info_tsv (PoolWorker-9) Loaded 4166 candidates for work_standalone/dataset/work.299/candidates_0.tsv INFO 2018-10-19 04:47:58,567 extract_info_tsv (PoolWorker-11) Loaded 6805 candidates for work_standalone/dataset/work.294/candidates_0.tsv INFO 2018-10-19 04:47:58,634 extract_info_tsv (PoolWorker-12) Loaded 9392 candidates for work_standalone/dataset/work.271/candidates_0.tsv INFO 2018-10-19 04:47:58,758 extract_info_tsv (PoolWorker-13) Loaded 13613 candidates for work_standalone/dataset/work.86/candidates_0.tsv INFO 2018-10-19 04:47:58,811 extract_info_tsv (PoolWorker-14) Loaded 15123 candidates for work_standalone/dataset/work.63/candidates_0.tsv INFO 2018-10-19 04:47:58,816 extract_info_tsv (PoolWorker-15) Loaded 16266 candidates for work_standalone/dataset/work.154/candidates_0.tsv INFO 2018-10-19 04:47:58,816 extract_info_tsv (PoolWorker-16) Loaded 16292 candidates for work_standalone/dataset/work.13/candidates_0.tsv INFO 2018-10-19 04:47:58,835 extract_info_tsv (PoolWorker-17) Loaded 17776 candidates for work_standalone/dataset/work.289/candidates_0.tsv INFO 2018-10-19 04:48:00,809 call_neusomatic N_dataset: 107085

It stays like this until it is terminated.

No error but without any further steps it is like that until the processes is terminated. The problem is shown in the function "callvariants", line "outputs, = net(matrices)". It gets stuck in this step. Please let me know which data shall I share with you - maybe output of "scan_alignment"?

msahraeian commented 5 years ago

@vendig Please pull the latest commit and run call.py again. We have done some fixes and also more logs at #20. Please also try to reduce number of threads if it still fails. I guess you may go out of memory.
If it failed again, send me the log and one of your larger tsv files.

vendig commented 5 years ago

@msahraeian now fails already at dataloader.py step RROR 2018-10-19 21:29:35,280 extract_info_tsv (PoolWorker-2) Traceback (most recent call last): File "/local_disk0/neu2/neusomatic-master/neusomatic/python/dataloader.py", line 101, in extract_info_tsv assert i + 1 == L UnboundLocalError: local variable 'i' referenced before assignment

msahraeian commented 5 years ago

@vendig yes, sorry this is the error I just introduced. It should be good now. Please pull again.

vendig commented 5 years ago

Hi. Still not working and also log files can not be found. I don't think it is a memory issue, since in the first run only 10.000 candidates are loaded. In case of chromosome 19 the same amount and it went through. Here the first run freezes and it is not terminated or no errors are shown. Command is running but not doing anything. I did some analysis and the problem is here: function "callvariants", line "outputs, = net(matrices)".

Please find .tsv file of scan alignment here: https://drive.google.com/drive/folders/1nbNWGxlHI1YbCre3TDMZS7oM7SMu75Cl

msahraeian commented 5 years ago

@vendig The line you mentioned is for running the CNN network. Please pull the master branch agian, an run with --max_load_candidates 10000 --num_threads 4. If it failed please send me the printed logs. I just tested you .tsv file and it went through smoothly.

vendig commented 5 years ago

Hi. I run the code with additional two commands and it did not go through till the end to produce VCF file. At least this time a process stopped. I could not find any log file as well.

Output: INFO 2018-10-21 17:33:23,561 main use_cuda: False INFO 2018-10-21 17:33:23,561 call_neusomatic -----------------Call Somatic Mutations-------------------- INFO 2018-10-21 17:33:23,576 call_neusomatic Load pretrained model from checkpoint ../../neusomatic/models/NeuSomatic_v0.1.0_standalone_Dream3_70purity.pth INFO 2018-10-21 17:33:23,579 call_neusomatic tag: NeuSomatic_v0.1.0_standalone_Dream3_70purity INFO 2018-10-21 17:33:40,294 call_neusomatic Run for candidate files: ['work_standalone/dataset/work.12/candidates_0.tsv', 'work_standalone/dataset/work.298/candidates_0.tsv', 'work_standalone/dataset/work.64/candidates_0.tsv', 'work_standalone/dataset/work.75/candidates_0.tsv', 'work_standalone/dataset/work.85/candidates_0.tsv', 'work_standalone/dataset/work.296/candidates_0.tsv', 'work_standalone/dataset/work.297/candidates_0.tsv'] INFO 2018-10-21 17:33:40,295 dataloader (1000000, 1000000) INFO 2018-10-21 17:33:40,295 dataloader (1000000, 1000000) INFO 2018-10-21 17:33:40,297 dataloader Len's of tsv files in this batch: [0, 0, 0, 0, 0] INFO 2018-10-21 17:33:40,304 extract_info_tsv (PoolWorker-1) Loaded 0 candidates for work_standalone/dataset/work.12/candidates_0.tsv INFO 2018-10-21 17:33:40,304 extract_info_tsv (PoolWorker-2) Loaded 0 candidates for work_standalone/dataset/work.298/candidates_0.tsv INFO 2018-10-21 17:33:40,304 extract_info_tsv (PoolWorker-3) Loaded 0 candidates for work_standalone/dataset/work.64/candidates_0.tsv INFO 2018-10-21 17:33:40,304 extract_info_tsv (PoolWorker-4) Loaded 0 candidates for work_standalone/dataset/work.75/candidates_0.tsv INFO 2018-10-21 17:33:40,304 extract_info_tsv (PoolWorker-1) Loaded 0 candidates for work_standalone/dataset/work.85/candidates_0.tsv INFO 2018-10-21 17:33:40,305 dataloader Len's of tsv files in this batch: [194, 980] INFO 2018-10-21 17:33:40,316 extract_info_tsv (PoolWorker-5) Loaded 194 candidates for work_standalone/dataset/work.296/candidates_0.tsv INFO 2018-10-21 17:33:40,339 extract_info_tsv (PoolWorker-6) Loaded 980 candidates for work_standalone/dataset/work.297/candidates_0.tsv INFO 2018-10-21 17:33:40,363 call_neusomatic N_dataset: 1174

vendig commented 5 years ago

Interesting, that I am bumping into the same problem when running only chromosome 19, which worked just two weeks ago. Also test dataset does not go through anymore. If it works in your case, then I guess it must be a problem with our settings.

msahraeian commented 5 years ago

@vendig yes the test script works fine for me. If it fails for you it means that some of your settings may have problems (I guess your Pytorch library, you may try reinstall it). You may also use NueSomatic Docker image for v0.1.2 for CPU-only runs. You can test docker as shown here. You can follow the example in https://github.com/bioinform/neusomatic/blob/master/test/docker_test.sh on who to invoke docker to run preprocess/call/postprocess steps for your dataset.

vendig commented 5 years ago

@msahraeian it was our problem. After installing conda, a different version of library started being called. The issue is finally resolved. I am sorry for that. Unfortunately, docker can not be implemented on Databricks, therefore this is most probably the only way I can proceed. Thank you a lot.

vendig commented 5 years ago

Hi. It might be a silly question but I am wondering why bam file header must completely match to the reference file header? In case i am interested only in real chromosomes, without any decoy chromosome and additionally i have a matching reference for real chromosome, but not decoy ones, can I run files somehow through? Unfortunately, only bam files are provided without a reference. I have a matching reference where real chromosome matched to bam file, but the reference does not include decoy ones. I have problems finding the original reference though. Thank you a lot.

At the moment when I run reference with only real chromosome and bam files with both types the error appears:

the bam file and the reference do not match. exit.. please check the bam header and the reference file

roger-liu-bina commented 5 years ago

@vendig Thanks for raising the point. The major reason is to fool-proof. For say, if a user aligned against B37 but provided hg38 reference, this feature would immediately tell the user. This is kind of a good standard that we learned from GATK. Usually, you want to use the same reference that your fastq aligned to, wouldn't you?

vendig commented 5 years ago

It is the problem in cases when public or restricted datasets' bam files are used and the complete matching reference is not provided. At the moment I am struggling with the dataset, which complete reference match can not be found. I can find a matching reference sequence for real chromosomes such as:

@SQ SN:chr1 LN:247249719 @SQ SN:chr2 LN:242951149

but unfortunately also some chromosome are added, which source is to me a mistery. Probably, they took random parts of sequences of real chromosome.

@SQ SN:chr1_random LN:1663265 @SQ SN:chr2_random LN:185571

But because I am interested only in chr1, chr2 ... this should not be a problem? Thank you.

msahraeian commented 5 years ago

@vendig As @roger-liu-bina mentioned for confident process we want to check for exact match between contigs by default. We try to accommodate optional handling of partially overlapped contigs across reference and bam, in near future. Meanwhile, can you share with me the header of your bam (using samtools view -H alignment.bam) if it is not private? I may be able to help you find the matching reference. Another option that may help you for the time being is to remove those contings from the bam header and fix the bam using picard ReplaceSamHeader.

vendig commented 5 years ago

@msahraeian this is the header of the dataset: https://drive.google.com/file/d/18qVYhX3qAisBhxbNTiFIKmQT7vaO1QxL/view

And the reference partially matching: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.12/

msahraeian commented 5 years ago

@vendig It is the hg18 reference. You can download the .fasta file from ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/working/20101201_cg_NA12878/Homo_sapiens_assembly18.fasta and index it as: samtools faidx Homo_sapiens_assembly18.fasta this has the same references in the same order as in your .bam