NVlabs / nvbio

NVBIO is a library of reusable components designed to accelerate bioinformatics applications using CUDA.
BSD 3-Clause "New" or "Revised" License
206 stars 50 forks source link

Issues with Paired-end datasets #28

Open ps-account opened 5 years ago

ps-account commented 5 years ago

Running the code (also using vmiheer latest version) on a paired end dataset leads to a crash. gdb seems to indicate the "opposite alignment kernel" might be where things go wrong...

info    : [0] aligning reads [168820736, 169869311]
verbose : [0]   1048576 reads
verbose : [0]   209.715 M bps (300.0 MB)
verbose : [0]   100.0 bps/read (min: 100, max: 100)
verbose : [0]   26.7 K reads/s
info    : [0] aligning reads [169869312, 170758330]
verbose : [0]   889019 reads
verbose : [0]   177.764 M bps (254.3 MB)
verbose : [0]   100.0 bps/read (min: 100, max: 100)
error   : opposite alignment kernel: an illegal memory access was encountered
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  device free failed: an illegal memory access was encountered

Thread 17 "nvBowtie" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fff80d61700 (LWP 25577)]
0x00007ffff693c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
ps-account commented 5 years ago

backtrace, it might be just a paired end issue

#0  0x00007ffff693c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff693e02a in __GI_abort () at abort.c:89
#2  0x00007ffff74ae8f7 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff74b4a46 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff74b3aa9 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff74b4458 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff6ce1573 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7  0x00007ffff6ce1ad1 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8  0x00007ffff74b4ca7 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x0000000000695e7b in thrust::cuda_cub::throw_on_error (status=cudaErrorIllegalAddress,
    msg=0x9f1f37 "device free failed") at /home/bla/local/cuda/cuda-10.0/include/thrust/system/cuda/detail/util.h:194
#10 0x00000000006a2196 in thrust::cuda_cub::free<thrust::cuda_cub::tag, thrust::device_ptr<void> > (ptr=...)
    at /home/bla/local/cuda/cuda-10.0/include/thrust/system/cuda/detail/malloc_and_free.h:87
#11 0x00000000006a02c1 in thrust::free<thrust::cuda_cub::tag, thrust::device_ptr<void> > (exec=..., ptr=...)
    at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/malloc_and_free.h:78
#12 0x000000000069f3d2 in thrust::device_free (ptr=...)
    at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/device_free.inl:40
#13 0x000000000072b940 in thrust::device_malloc_allocator<unsigned int>::deallocate (this=0x7fff80d60ab0, p=...,
    cnt=1572862) at /home/bla/local/cuda/cuda-10.0/include/thrust/device_malloc_allocator.h:148
#14 0x0000000000728eda in thrust::detail::allocator_traits<thrust::device_malloc_allocator<unsigned int> >::deallocate(thrust::device_malloc_allocator<unsigned int>&, thrust::device_ptr<unsigned int>, unsigned long)::workaround_warnings::deallocate(thrust::device_malloc_allocator<unsigned int>&, thrust::device_ptr<unsigned int>, unsigned long) (a=..., p=..., n=1572862)
    at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/allocator/allocator_traits.inl:257
#15 0x0000000000728f07 in thrust::detail::allocator_traits<thrust::device_malloc_allocator<unsigned int> >::deallocate (
    a=..., p=..., n=1572862) at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/allocator/allocator_traits.inl:261
#16 0x000000000072628c in thrust::detail::contiguous_storage<unsigned int, thrust::device_malloc_allocator<unsigned int> >::deallocate (this=0x7fff80d60ab0) at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/contiguous_storage.inl:190
#17 0x0000000000725ee8 in thrust::detail::contiguous_storage<unsigned int, thrust::device_malloc_allocator<unsigned int> >::~contiguous_storage (this=0x7fff80d60ab0, __in_chrg=<optimized out>)
    at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/contiguous_storage.inl:64
#18 0x0000000000770fe8 in thrust::detail::vector_base<unsigned int, thrust::device_malloc_allocator<unsigned int> >::~vector_base (this=0x7fff80d60ab0, __in_chrg=<optimized out>)
    at /home/bla/local/cuda/cuda-10.0/include/thrust/detail/vector_base.inl:497

---Type <return> to continue, or q <return> to quit---
#19 0x00000000007701aa in thrust::device_vector<unsigned int, thrust::device_malloc_allocator<unsigned int> >::~device_vector (this=0x7fff80d60ab0, __in_chrg=<optimized out>) at /home/bla/local/cuda/cuda-10.0/include/thrust/device_vector.h:78
#20 0x0000000000770854 in nvbio::vector<nvbio::device_tag, unsigned int>::~vector (this=0x7fff80d60ab0,
    __in_chrg=<optimized out>) at /home/bla/local/nvBowtie-cuda10/nvbio/nvbio/basic/vector.h:113
#21 0x0000000000774d90 in nvbio::io::SequenceDataStorage<nvbio::device_tag>::~SequenceDataStorage (this=0x7fff80d60a00,
    __in_chrg=<optimized out>) at /home/bla/local/nvBowtie-cuda10/nvbio/nvbio/io/sequence/sequence.h:436
#22 0x0000000000768221 in nvbio::bowtie2::cuda::ComputeThreadPE::do_run (this=0x37aee80)
    at /home/bla/local/nvBowtie-cuda10/nvbio/nvBowtie/bowtie2/cuda/compute_thread.cu:597
#23 0x00000000007682b5 in nvbio::bowtie2::cuda::ComputeThreadPE::run (this=0x37aee80)
    at /home/bla/local/nvBowtie-cuda10/nvbio/nvBowtie/bowtie2/cuda/compute_thread.cu:693
#24 0x0000000000678930 in nvbio::Thread<nvbio::bowtie2::cuda::ComputeThreadPE>::execute (arg=0x37aee80)
    at /home/bla/local/nvBowtie-cuda10/nvbio/nvbio/basic/threads.h:116
#25 0x00007ffff7bc16ba in start_thread (arg=0x7fff80d61700) at pthread_create.c:333
#26 0x00007ffff6a0e41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
ps-account commented 5 years ago

now with cuda-gdb. Another weird thing is this issue happens on Pascal but not on Maxwell. The finishing of the alignment kernel seems to be the issue


info    : [0] aligning reads [168820736, 169869311]
verbose : [0]   1048576 reads
verbose : [0]   209.715 M bps (300.0 MB)
verbose : [0]   100.0 bps/read (min: 100, max: 100)
verbose : [0]   26.8 K reads/s
info    : [0] aligning reads [169869312, 170758330]
verbose : [0]   889019 reads
verbose : [0]   177.764 M bps (254.3 MB)
verbose : [0]   100.0 bps/read (min: 100, max: 100)

CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x562ebd0

Thread 17 "nvBowtie" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 156, grid 433544, block (5731,0,0), thread (78,0,0), device 0, sm 18, warp 40, lane 14]
0x000000000562ebf0 in nvbio::bowtie2::cuda::detail::finish_alignment_kernel<nvbio::bowtie2::cuda::detail::BestTracebackStream<0u, nvbio::aln::GotohAligner<(nvbio::aln::AlignmentType)1, nvbio::bowtie2::cuda::SmithWatermanScoringScheme<nvbio::bowtie2::cuda::QualCost<int>, nvbio::bowtie2::cuda::ConstantCost<int> >, nvbio::aln::PatternBlockingTag>, nvbio::bowtie2::cuda::TracebackPipelineState<nvbio::bowtie2::cuda::SmithWatermanScoringScheme<nvbio::bowtie2::cuda::QualCost<int>, nvbio::bowtie2::cuda::ConstantCost<int> > > >, nvbio::bowtie2::cuda::SmithWatermanScoringScheme<nvbio::bowtie2::cuda::QualCost<int>, nvbio::bowtie2::cuda::ConstantCost<int> >, nvbio::bowtie2::cuda::TracebackPipelineState<nvbio::bowtie2::cuda::SmithWatermanScoringScheme<nvbio::bowtie2::cuda::QualCost<int>, nvbio::bowtie2::cuda::ConstantCost<int> > > ><<<(5734,1,1),(96,1,1)>>> ()
ps-account commented 5 years ago

How to reproduce creating a truncated sam file from a small unpaired dataset, assuming you have installed nvbio:

# get arabidopsis from e.g. illumina igenome
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Arabidopsis_thaliana/Ensembl/TAIR10/Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz
# unpack
tar -zxvf Arabidopsis_thaliana_Ensembl_TAIR10.tar.gz
cd Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/
# create index
nvBWT -d 1 genome.fa genome-index

cd 
# if you don't have it, download sra toolkit from https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/
~/sratoolkit.2.9.6-1-ubuntu64/bin/prefetch -v ERX3219973
~/sratoolkit.2.9.6-1-ubuntu64/bin/fastq-dump --outdir . --split-files $HOME/ncbi/public/sra/ERX3219973.sra

# now run nvBowtie
nvBowtie -x $HOME/Arabidopsis_thaliana/Ensembl/TAIR10/Sequence/WholeGenomeFasta/genome-index  -1 ERX3219973_1.fastq -2 ERX3219973_2.fastq -S ERX3219973.bam

# make sure you have samtools installed, then run 
samtools view ERX3219973.bam | tail -n1
[main_samview] truncated file.
ERX3219973.91 ST-J00101:86:HMYKLBBXX:7:1103:9709:41950 length=150       4       *       0       0       *       *       0   0AACCGGTGAGACTTCCAATGATTGATTCAAATTAACTTCGAAGCTTCCATTTGTTCTTCACTTTGCTGACTGTGTTTATTGTTGGTTACAGGAAGGCAAGGACAATGTTAGAGTCATAGGTATTTTTCTTGACTTGTCTCAGATAAAGGG       AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
ps-account commented 5 years ago

I get an error at the start of nvBowtie, could that be related?

info    : nvBowtie... started
verbose :   cuda devices : 1
verbose :   device 0 has compute capability 6.1
verbose :     SM count          : 20
verbose :     SM clock rate     : 1733 Mhz
verbose :     memory clock rate : 4.5 Ghz
verbose :   chosen device 0
verbose :     device name        : Quadro P5000
verbose :     compute capability : 6.1
visible : mapping reference index... started
info    :   file: "genome-index"
info    : SequenceDataMMAP: error mapping file "/nvbio.genome-index.seq_info" (2)!
visible : mapping reference index... failed
visible : loading reference index... started
info    :   file: "genome-index"
visible : loading reference index... done
visible : FMIndexData: loading... started
visible :   genome : genome-index
info    : reading bwt... started
info    : reading bwt... done
verbose :   length: 119667750
info    : building occurrence table... started
teepean commented 3 years ago

I have experienced the same problem. Stracing shows following error:

openat(AT_FDCWD, "/dev/shm/nvbio.hs37d5-index.seq_info", O_RDONLY|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory)

EDIT: This error occurs when the shared memory is not running. So in my case I ran:

./nvFM-server hs37d5-index hs37d5 &