dieterich-lab / DCC

DCC uses output from the STAR read mapper to systematically detect back-splice junctions in next-generation sequencing data. DCC applies a series of filters and integrates data across replicate sets to arrive at a precise list of circRNA candidates.
https://dieterichlab.org/software/
GNU General Public License v3.0
36 stars 20 forks source link

BAM error #36

Closed tjakobi closed 5 years ago

tjakobi commented 7 years ago

Problem occurs when -G and -B is specified:

Traceback (most recent call last): File "/home/sstrohbuecker/.local/bin/DCC", line 9, in load_entry_point('DCC==0.4.4', 'console_scripts', 'DCC')() File "build/bdist.linux-x86_64/egg/DCC/main.py", line 408, in main File "build/bdist.linux-x86_64/egg/DCC/main.py", line 675, in checkBAMsorting File "pysam/libcalignmentfile.pyx", line 351, in pysam.libcalignmentfile.AlignmentFile.cinit (pysam/libcalignmentfile.c:5200) File "pysam/libcalignmentfile.pyx", line 584, in pysam.libcalignmentfile.AlignmentFile._open (pysam/libcalignmentfile.c:7797) ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

MaxHills commented 5 years ago

I am using DCC version 0.4.7 and I have this exact issue. I have single-end stranded data. My command line and error output:
python DCC/DCC/main.py dcc_fileLists/ADAR_samplesheet -B dcc_fileLists/ADAR_BAM_fileList -an mm10/mm10.all.gtf -T 14 -M -Nr 2 1 -G -A mm10/mm10.ucsc.fa -R mm10/mm10.allRepeats.gtf
DCC 0.4.7 started 32 CPU cores available, using 14 Traceback (most recent call last): File "DCC/DCC/main.py", line 826, in main() File "DCC/DCC/main.py", line 427, in main unsortedBAMS = checkBAMsorting(bamfiles) File "DCC/DCC/main.py", line 706, in checkBAMsorting bamfile = pysam.AlignmentFile(file, "rb") File "pysam/libcalignmentfile.pyx", line 734, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 983, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False

My BAM file list:
/l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/ACTTGA/ACTTGA.Aligned.out.bam /l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/CAGATC/CAGATC.Aligned.out.bam /l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/CCGTCC/CCGTCC.Aligned.out.bam /l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/GCCAAT/GCCAAT.Aligned.out.bam

My sample sheet:
/l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/ACTTGA/ACTTGA.Chimeric.out.junction /l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/CAGATC/CAGATC.Chimeric.out.junction /l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/CCGTCC/CCGTCC.Chimeric.out.junction /l/Yu/YuLab/Bioinformatics/projects/mhh_CIRCexplorer2/align/ADAR/GCCAAT/GCCAAT.Chimeric.out.junction

The head of my first BAM file in my BAM list (using samtools view | head):
D00575:258:H35GGBCXY:1:1105:19038:86055 16 chr8 129234652 255 101M 0 0 GATCCTGCACTCACCATGACCTCCTTCGTAGCTTGCTTGAACTTTCTTCACAGCACTTCCCCTTCTTGAAGGTATCTGATAGCCTGTTACTGAACTTGGAG HIIHIIIIIIIIIIIHHHHIIHIIIIIIIIIIIIIIIIIIIIIIIIIIHHIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIDDDDD NH:i:1 HI:i:1 AS:i:99 nM:i:0 D00575:258:H35GGBCXY:1:1105:19163:86071 16 chr8 83932886 255 77M463N24M 0 0 GGTGTTCCCCCAAGAGTATCCCAGTGAGAACTCCATTCAGCTCTCCGCCAACACCATCAAGCAGAACAGCCGCAACGGTGTGGTGAAAGTTGTCTTCATTC IIIHGIIIHIIIIIHIIIIHGHHHIIIHHGIHIIIIIIIHGIIHIIIIHHHHEIIIIGIIIIGIHHHIIIIIHFIHIIIIIHIIIIIIIIIIIIHFDDDDD NH:i:1 HI:i:1 AS:i:101 nM:i:0 D00575:258:H35GGBCXY:1:1105:19027:86147 16 chr17 39845997 255 99M2S 0 0 AACCCACCACCCTGTGCTCCGCGCCCGGTGCGGTCGACGTTCCGGCTCTCCCGATGCCGAGGGGTTCGGGATTTGTGCCGGGGACGGAGGGGAGAGCGGGT GFIHGDHHFIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDDDDD NH:i:1 HI:i:1 AS:i:95 nM:i:1 D00575:258:H35GGBCXY:1:1105:19141:86150 0 chr1 155278727 255 101M 0 0 CCGCCACTCAGCTCACTACCAGAGAAAGAAGCTGACAATTCACAGGGCTCTGGATACACAGTACCACTGATTTTATTTGTACAAGAAATGACTGGTCACTG DDDDDIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIIH NH:i:1 HI:i:1 AS:i:99 nM:i:0

I see that this issue is resolved, but I have been unable to overcome this error and am unsure of how to proceed. I glanced at the fix for this and do not think it applies in my case. Any insight or help would be greatly appreciated.

tjakobi commented 5 years ago

The issue may be related to the fact that you are employing single stranded data, as the old fix only may have worked for the paired-end mode I normally use. I will look into it.

tjakobi commented 5 years ago

Hi @MaxHills,

are the .bai indices available for all of the BAM files? I am wondering that a ValueError is thrown as that exception should be handled correctly. Do you have the log file of DCC? Do you see something like "BAM file XX has no index (XX.bai is missing)" ? If not, what does file XX.bam show?

Cheers, Tobias

MaxHills commented 5 years ago

Hi @tjakobi, Yes, the .bai files are available for all BAM files and in the same directory as the BAM files and the chimeric.out.junction files. The DCC log file contains only 2 lines; the first saying that DCC 0.4.7 started and the second providing a document of the commands given on the command line. The only error messages available are those as seen in my original post, above.

If I use samtools view XX.bam | head the file appears as a normal sorted file in SAM format, as seen below:

samtools view Aligned.sortedByCoord.out.bam | head D00575:258:H35GGBCXY:2:2215:9582:96683 256 chr1 3001362 1 70M31S 0 0 CTGTCTTTTTCCCTGAGGTGGGTTTCCTGTAAGCAACAAAATGTTGGGTCCTGTTTGTGTAGCCAGTCTAGATCGGAAGAGCACACGTCTGAACTCCAGTC DDDDDHIIIGHHIIIIIIGHHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIHHIIIIHIIIIIIIIIIIIIGIIIIHIIIIIIIII NH:i:4 HI:i:2 AS:i:68 nM:i:0 D00575:258:H35GGBCXY:2:1114:12175:20089 0 chr1 3006185 255 101M 0 0 CACGCCTGCTCAAAATGCAGAGTTGTGAAGCCCAGTTACAACTGATATACCTATAACACAAATTCTACACCTAAACCTTGAGGACTATTGTGGAAGAAGGG DDDCDHHIIHIIIIGIHHIIGGHHIFFHHHHHHHHIIIIIIIGIIIIIIHIIHIIIIIIGIIHHHHEHGHEHIIIIGIEHGHHHGIIIGIIHI?FHFHHHH NH:i:1 HI:i:1 AS:i:99 nM:i:0 D00575:259:H52C5BCXY:1:1213:20285:64486 0 chr1 3006185 255 101M 0 0 CACGCCTGCTCAAAATGCAGAGTTGTGAAGCCCAGTTACAACTGATATACCTATAACACAAATTCTACACCTAAACCTTGAGGACTATTGTGGAAGAAGGG DDDA@HHHHHHHHIIIHGIHHHGHHIHHGHFHHIHHIHIIIIGIGEHH@HGHIIIIIIIHIIIIIIIIGHIIIIEHHHHHIIEEH@GHHIHIIFHHHGHII NH:i:1 HI:i:1 AS:i:99 nM:i:0 D00575:259:H52C5BCXY:1:2109:5490:18367 0 chr1 3006185 255 101M 0 0 CACGCCTGCTCAAAATGCAGAGTTGTGAAGCCCAGTTACAACTGATATACCTATAACACAAATTCTACACCTAAACCTTGAGGACTATTGTGGAAGAAGGG DDDDDIIHIICGHHIIIIIIIIIHIIIIIIIIHIIIHIIIIIIHIIIIIIIIIIIGHIIIIIIIIIIIIIIIIIIIIIHIHCHHHIIIHHIIIIGIIIIII NH:i:1 HI:i:1 AS:i:99 nM:i:0

I wish I could provide you with a key to understanding my issue, but I am also perplexed.

Best regards, Max

tjakobi commented 5 years ago

Dear @MaxHills,

would it be possible to upload one of the bam files (the first few 100 lines + head are probably enough) for further debugging? I have a suspicion but would need to run more tests.

Cheers, Tobias

MaxHills commented 5 years ago

Dear @tjakobi, I have uploaded a sam file (with '.txt' extension, so GitHub will accept it) with the header lines and a few hundred reads. dcc.test.txt

tjakobi commented 5 years ago

Hi @MaxHills,

lets try something: instead of

-B dcc_fileLists/ADAR_BAM_fileList

use

-B @dcc_fileLists/ADAR_BAM_fileList.

in the DCC call.

Cheers, Tobias

MaxHills commented 5 years ago

Okay, I am no longer receiving the BAM error. Thank you.

tjakobi commented 5 years ago

Added CLI check to make sure BAM file list is either binary or ASCII multi line with @.