danielvo commented 10 years ago

Hello,

We have just installed bcbio-nextgen (0.7.8) and tried the RNAseq pipeline. Everything seems to have gone well, with two exceptions:

1) The STAR alignment step failed on the first try because there were no STAR-specific genome files available after the initial bcbio installation. Running STAR's "--runMode genomeGenerate" prior to restarting the alignment step solved the problem. Perhaps this is due to us forgetting something during the installation?

2) Count result files ("combined.counts", "annotated_combined.counts" and sample-wise "*-ready.counts") had all the counts equal to 0, even though the counts by Cufflinks and rnaseqc were meaningful. We tracked the problem down to featureCounts (1.4.4), which provided the following output in the debug log file (work/log/bcbio-nextgen-debug.log):

WARNING format issue in file '/scratch/SPH_RNA/bcbio/work/align/NET-27 ...				The required format is : SAM				The file format is unknown.				A wrong format may result in wrong results or crash the program.				Please refer to the manual for file format options.				If the file is in the correct format, please ignore this message.
Assign reads to features...
Total reads : 51757520
Successfully assigned reads : 0 (0.0%)
Running time : 0.61 minutes

Read assignment finished.

The command log (work/log/bcbio-nextgen-debug.log) shows that featureCounts was run on BAM file:

a) /scratch/SPH_RNA/bcbio/work/align/NET-27/1_070414_SPH_RNA_initial_run_star/NET-27.sorted.bam

In directory "/scratch/SPH_RNA/bcbio/work/align/NET-27/", there were 3 additional alignment files that looked useful for testing:

b) 1_070414_SPH_RNA_initial_runAligned.out.bam c) 1_070414_SPH_RNA_initial_runAligned.out.reheadered.bam d) 1_070414_SPH_RNA_initial_runAligned.out.sam

Interestingly, manually running featureCounts* on either of the BAM files leads to 0 successfully assigned reads, producing the same warning as we could find in the debug logs, while a test run on SAM file d) finished without any problems.

Comparing files b) and d) with diff\ produces no differences, therefore we concluded that the problem must be in the featureCounts program itself and its ability to take BAM format input.

I can provide more information if necessary. If this indeed is an issue with featureCounts, would there be any quick fix available?

With thanks and best regards, Daniel

Commands used: featureCounts*: /scratch/bcbio/usr/local/bin/featureCounts -a /mnt/bcbio/usr/local/share/bcbio-nextgen/genomes/Hsapiens/GRCh37/rnaseq/ref-transcripts.gtf -o /scratch/TESTS/out ${BAM_or_SAM}

diff**: diff <(samtools view -h ./1_070414_SPH_RNA_initial_runAligned.out.bam) <(samtools view -S -h ./1_070414_SPH_RNA_initial_runAligned.out.sam)

roryk commented 10 years ago

Hi Daniel,

Thank you for the awesome bug report and your work digging into the issue, it is really helpful. I'm sorry for the troubles. The STAR genome not being installed is my bad, we don't install the STAR genomes by default because they take forever to build and are too big to host for download. The goal was to have something that works out of the box and doesn't take too long to get going and let users add on bits as they need them-- unfortunately the example template for RNA-seq that we provide has STAR as the aligner by default!

After installation, if you want to use the STAR aligner you have to install the STAR indices with:

bcbio_nextgen.py upgrade --data --aligners STAR --genomes GRCh37

Could you blow away the STAR index you made and remake it with that?

I changed the default aligner in the illumina-rnaseq.yaml file to Tophat2 here: 88f3d3a. I like the STAR aligner better because it is fast but this way our goal of having something that works out of the box will be preserved.

For the second issue, I'm not sure why featureCounts is not working on those BAM files, that is strange. I can't seem to reproduce the issue-- I just tried it out on those two BAM files created by a test run and it seems to work fine for me with 1.4.4. I am at a loss.

Do you think you could install the STAR genome and give it another go and see if that somehow resolves the featureCounts issue magically as well?

Thanks again!

danielvo commented 10 years ago

Hello Rory,

Thank you for the quick reply! I did a quick check here, since your featureCounts 1.4.4. seems to work on BAM files. It seems that BAM files with a million or 10 million lines are OK, but 200 million lines or so lead to the strange warning phenomenon. I tried this on two different BAM files; featureCounts 1.4.4. worked fine on 1 million line files in both cases, while failing on complete files. Do you have a possibility to run a quick test on a big BAM? You will notice within a minute whether the fault occurs. Strangely, while trying a complete SAM file for one of the two big BAMs, things work fine.

danielvo commented 10 years ago

I suspect this is a BAM decompression issue. Maybe the problem is the size of our temp directory. I will investigate that and post my findings here.

roryk commented 10 years ago

Hi Daniel,

Good idea regarding the BAM file sizes. I tried featureCounts with a 300 million read BAM file with 1.4.4 and it seems to work okay. Thanks for your work figuring out what the issue is, let me know what you find out.

dbpzdbpz commented 10 years ago

Hi Danielvo,

FeatureCounts tests the file format by simply decompressing the first couple of bytes and looking for the BAM magic string ("BAM\1"). If it complains about the file format, it is very likely that the file is neither a BAM nor a SAM file.

I have just downloaded a BAM file generated by STAR, and featureCounts successfully assigned most of the reads in it.

Could you please use samtools to validate the file format like:

samtools view -h 1_070414_SPH_RNA_initial_runAligned.out.bam | head -n 1000

You should be able to see the first 1000 lines in the BAM file, and there should not be any warning messages reported by samtools.

Cheers,

Yang

danielvo commented 10 years ago

Hello Rory, hello Yang,

It seems we have solved the mystery. The problem is indeed related to large BAM files only (even well-formed ones) and will not occur with small ones. We ran "strace -e trace=file" on the featureCounts command and got an explanation. Here are vital excerpts from the combined featureCounts/strace ouput:

1) When run on a big SAM file: ... [ Process PID=4750 runs in 32 bit mode. ] ... open("/path/BIG.sam", O_RDONLY|O_LARGEFILE) = 3 ... open("/path/BIG.sam", O_RDONLY|O_LARGEFILE) = 3 ... open("/path/BIG.sam", O_RDONLY|O_LARGEFILE) = 3 ... || Successfully assigned reads : 139848773 (71.5%) ...

2) When run on a big BAM file: ... [ Process PID=4997 runs in 32 bit mode. ] ... open("/path/BIG.bam", O_RDONLY|O_LARGEFILE) = 3 ... open("/path/BIG.bam", O_RDONLY) = -1 EFBIG (File too large) ... || Successfully assigned reads : 0 (0.0%) ...

Apparently, a 32-bit version of featureCounts has been installed in our /path/bcbio/usr/local/bin directory (we are not clear over how, since all other binaries there are 64-bit). It seems to always open SAM files with the "O_LARGEFILE" flag and therefore everything is fine with them. When run on a BAM file, however, sometimes (in our case that was in 4 out of 12 open() instances during the course of a single BAM file analysis) the file is open without the flag, which leads to an error. This would also explain why small BAM files are not a problem. I assume this to be a bug.

We replaced the /path/bcbio/usr/local/bin/featureCounts binary with a 64-bit manually compiled file (version 1.4.4) and after a re-run of the whole RNAseq pipeline we have now proper values in the count files.

We wouldn't have dug this deep if you hadn't helped us out with testing - thank you for the effort!

Thank you also for the STAR indices installation command*. We haven't found it prior to our manual solution. In addition, our production server is now not connected to the Internet (therefore, an own command-line solution was the preferred way). However, we have now rebuilt the indices with the same options as those that are listed in https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/genomes.py - so I think there should be no problem with them. By the way, the STAR aligner is not listed in /path/bcbio_results/analysis_directory/programs.txt, and there seems to be no way to get its version through command-line help (which doesn't exist?).

Many thanks again, Daniel (on behalf of the local investigation team)

"--aligners star" :D

chapmanb commented 10 years ago

Daniel; Thank you for all this help digging into the issue. Apologies, it looks like we're the root cause of the problem since we install an i386 binary:

https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/custom/bio_nextgen.py#L63

Rory, do we do this on purpose or could we use the 64 bit compiled version (http://sourceforge.net/projects/subread/files/subread-1.4.4/)? We could also support MacOSX installs here since they have pre-built binaries for them.

roryk commented 10 years ago

Hi everyone,

Thanks for your work investigating this-- my bad, there was no reason to not be installing the 64-bit version; I swapped over to using that and also to use support OSX installs here: chapmanb/cloudbiolinux@45c9302f5. Sorry for the troubles Daniel, awesome detective work.

dbpzdbpz commented 10 years ago

Dear Danielvo,

Thank you for figuring it out! Indeed, featureCounts always opens files in the large mode. However, when featureCounts resorts reads by their names, it uses zlib to open the BAM file (BAM files are essentially segmented gzip files), and the gzopen function seems to be architecture-specific.

It is strongly recommended to use the 64-bit subread packages because most computers are now 64-bit. The 32-bit package has a lot of problems (for example, it creates a lot of index blocks because of the very limited virtural memory space); it may be used only if the computer is 32-bit.

Cheers,

Yang

bcbio / bcbio-nextgen

Issues with featureCounts #394

samtools view -h 1_070414_SPH_RNA_initial_runAligned.out.bam | head -n 1000