biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
558 stars 104 forks source link

Merging already merged BAM files #92

Closed brettChapman closed 7 years ago

brettChapman commented 10 years ago

Hi, I've been trying to merge over 100 BAM files.

First off, I aligned fastq files (both paired and singleton) using STAR to the genome, and generated my BAM files. The data was from a variety of different experiments from public repositories and collaborators. The genome I was working with, is a very large hexaploid plant genome, so I needed to align to each sub-genome, identify matching contigs, and then index and align to a new target genome. This was because STAR could not index the entire genome and alignment time would likely exceed the walltime.

The machine I've been running Sambamba on takes a very long time to merge that many BAM files and hits the walltime.

What I've done is broken down the problem by merging BAM files from the different experiments, into a number of different batches.

From around 100 BAM files, I now have about 11 BAM files.

I tried to merge these 11 BAM files and I now get this error: sambamba-merge: graph contains cycles

Is there a way around this? I tried indexing the BAM files using sambamba and then merging, but the problem remains. I'm about to try and sort the BAM files using sambamba and then merge them again. I had already sorted the BAM files initially by coordinate before the first merge I did, using samtools.

Any help with this would be much appreciated.

Thanks.

Regards

Brett

brettChapman commented 10 years ago

That version you downloaded is outdated and was used for proof of concept. It's advanced a lot since then. I was given access to the commercial version of the proteogenomics pipeline (I was unofficially its beta-tester for a while and helped with debugging and the addition of features), which now includes the code to build the fasta splice graph. The tool now generates straight from a SAM file into a fasta splice graph with an intermediate step which is automated to generate a splice info file.

I was considering as you mentioned, to convert, at least those 11 BAM files and concatenate, then remove redundancy. I am concerned though that there may be spliced regions which appear once in one BAM file, but after merging would appear twice. They would be missed with such an approach. But its better than nothing. If all else fails, I'll likely do that.

brettChapman commented 10 years ago

I've come up with a better idea. From the 11 BAM files, slice out each chromosome (chromosomes are denoted as e.g 1ALXXXX. Can slice accept wild cards? like 1AL* ? I could then generate smaller BAM files from all the files and merge them, generating a BAM file for each chromosome. I could then run my splice graph tool on each chromosome based BAM file, to generate splice regions for each chromosome. I imagine the @SQ headers based on chromosome by chromosome would be much smaller, allowing me to potentially run sambamba with GC enabled.

lomereiter commented 10 years ago

Does the updated tool have any documentation? I'd like to see it so as to understand what other options are available.

brettChapman commented 10 years ago

This is the usage from the tool:

[bchapman@scratchy ~]$ java -jar ~/bin/ProteogenomicScripts/Enosi_1.0/BuildSpliceGraph.jar BuildSpliceGraph.jar version 0.4.10 Usage: java -jar BuildSpliceGraph.jar -r [FILE] SAM/tsv file name or file list -d [FILE] Genome file in FASTA format -o [FILE] Output FASTA file name -y [NUM] Max peptide length in flattening (-s [NUM] Stage to start at. (0: SAM to SpliceInfo, 1:Prepare Genome, 2: Splice Info to MS2DB)) (-g [FILE] Splice Info file if starting from stage 1 or later)

ERROR:Missing command line argument: 'Must specify a SAM/tsv file, a genome file, an output file name, and a maximum peptide length'

brettChapman commented 10 years ago

Here is what the documentation says about generating a splice database:

Constructing the Splice Database While previous verison of Enosi used a 2-step process for creating an intermediate structure called a splice graph before creating the FASTA format spliced database, the newer version use only a single script.

java -Xmx10G -jar BuildSpliceGraph.jar -r [input_file_name or file_list] 
                                       -d [genome_seq_file] 
                                       -o [output_file_name] 
                                       -y [maximum_peptide_length]              

input_file_name - The SAM or VCF file containing the mapped RNA-Seq reads or variant calls. Alternatively, you can specify a list file containing a SAM or VCF file per line.
genome_seq_file - The FASTA format genome sequence file.
output_file_name - The output file name. We typically use the extension '.ms2db'.
maximum_peptide_length - The maximum expected length of peptides to be found by MS/MS (we recommend a value of 30). The database will be more compact with a smaller value.

The FASTA file produced has a the splicing information encoded in the header.

   >Splice@SeqName@GeneNum@FileIndex@Strand$Type/Start1/End1$Type/Start2/End2

The indicator 'Splice' at the beginning of the entry name indicates that this entry is from the RNA-Seq data. SeqName is the name of the sequence that was translated in the genome sequence file (e.g. chr1). We reserve the characters '@' and ';' and sequence names should not include this character. The gene number corresponds to the connected component index in the intermediate splice graph. The file index is the index of this entry in the file. The strand is either '1' for the forward strand or '0' for the reverse strand. Since the sequence in each entry may include splice junctions, several ranges are specified. Each range corresponds to a continuous nucleotide sequence. There are 3 types of ranges:

Standard - This is a 'normal' translated range. The start coordinate and the end coordinate both correspond to genomic coordinates (0-based number, left inclusive). The start coordinate is always lower than the end coordinate. The sequence is exactly the sequence for this range of coordinates in the genome.
Mutation - This indicates a mutation. These are always of length 1. The start coordinate and end coordinate correspond to genomic coordinates but the sequence for this region is not the same as in the reference genome.
Insertion - This indicates an insertion. The start coordinate is the first nucleotide that is displaced by the insertion. Since the inserted sequence has no analog on the genome, the end coordinate is -1 * the length of the insertion (e.g. -5 indicates an insertion of length 5).
lomereiter commented 10 years ago

So, -r option allows providing a file list instead of a single file, why not use that?

brettChapman commented 10 years ago

Generally -y is 30, larger numbers generate larger databases. Generally as the proteomics search only allows up to 30aa in the search in the Enosi pipeline, longer than 30aa isn't necessary. It would also inflate the database size impacting on the sensitivity of the search.

brettChapman commented 10 years ago

Ok, I'll give that a shot. The previous version didn't allow a list of file BAM files. Completely overlooked that. I'll see how it goes, if it will generate a single fasta file or multiple fasta files from a list of BAM files. Thanks.

brettChapman commented 10 years ago

I'll start with 2 BAM files and see if it generates only 1 fasta file from the 2 BAM files.

brettChapman commented 9 years ago

I've decided to split my BAM files into batches based on experiment. I now have 15 merged BAM files which I will provide to the splice graph tool. First though, I've attempted to remove PCR duplicates to remove anything that might give false interpretation of the number of spliced reads. I tried the Markdups function of sambamba and I get this error: sambamba-markdup: More than 16383 reference sequences are unsupported

I've also tried picard tools, and I get this error:

java -jar /opt/picard-tools/1.99/MarkDuplicates.jar INPUT=Wheat_RNAseq.sorted_batch1_grain_non-oriented.bam OUTPUT=Wheat_RNAseq.sorted_batch1_grain_non-oriented_rmDups.bam METRICS_FILE=Wheat_RNAseq.sorted_batch1_grain_non-oriented_metricDups.txt VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true TMP_DIR=picardTemp [Fri Sep 12 09:41:50 WST 2014] net.sf.picard.sam.MarkDuplicates INPUT=[Wheat_RNAseq.sorted_batch1_grain_non-oriented.bam] OUTPUT=Wheat_RNAseq.sorted_batch1_grain_non-oriented_rmDups.bam METRICS_FILE=Wheat_RNAseq.sorted_batch1_grain_non-oriented_metricDups.txt REMOVE_DUPLICATES=true TMP_DIR=[picardTemp] VALIDATION_STRINGENCY=SILENT PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false [Fri Sep 12 09:41:50 WST 2014] Executing as bchapman@scratchy on Linux 2.6.32-431.11.2.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_45-b18; Picard version: 1.99(1563) INFO 2014-09-12 09:41:50 MarkDuplicates Start of doWork freeMemory: 563917240; totalMemory: 567803904; maxMemory: 8422162432 INFO 2014-09-12 09:41:50 MarkDuplicates Reading input file and constructing read end information. INFO 2014-09-12 09:41:50 MarkDuplicates Will retain up to 33421279 data points before spilling to disk. INFO 2014-09-12 09:43:07 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:49s. Time for last 1,000,000: 49s. Last read position: 1AL3418167:3,972 INFO 2014-09-12 09:43:07 MarkDuplicates Tracking 829652 as yet unmatched pairs. 0 records in RAM. INFO 2014-09-12 09:43:59 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:01:42s. Time for last 1,000,000: 52s. Last read position: 1AL3873924:5,943 INFO 2014-09-12 09:43:59 MarkDuplicates Tracking 1708637 as yet unmatched pairs. 0 records in RAM. INFO 2014-09-12 09:44:49 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:02:31s. Time for last 1,000,000: 49s. Last read position: 1AL3883298:2,280 INFO 2014-09-12 09:44:49 MarkDuplicates Tracking 2580589 as yet unmatched pairs. 0 records in RAM. [Fri Sep 12 09:44:59 WST 2014] net.sf.picard.sam.MarkDuplicates done. Elapsed time: 3.16 minutes. Runtime.totalMemory()=4179099648 To get help, see http://picard.sourceforge.net/index.shtml#GettingHelp Exception in thread "main" net.sf.picard.PicardException: /scratch/share/data/Wheat_IWGSC_RNA-seq_alignments/INRA_PE_alignments/SAM_files/complete/Wheat_BAM_files/picardTemp/CSPI.8709064633600382632.tmp/5526.tmpnot found at net.sf.picard.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:63) at net.sf.picard.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:45) at net.sf.picard.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76) at net.sf.picard.sam.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:172) at net.sf.picard.sam.CoordinateSortedPairInfoMap.put(CoordinateSortedPairInfoMap.java:156) at net.sf.picard.sam.DiskReadEndsMap.put(DiskReadEndsMap.java:65) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:423) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:161) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:145) Caused by: java.io.FileNotFoundException: /scratch/share/data/Wheat_IWGSC_RNA-seq_alignments/INRA_PE_alignments/SAM_files/complete/Wheat_BAM_files/picardTemp/CSPI.8709064633600382632.tmp/5526.tmp (Too many open files) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(Unknown Source) at net.sf.picard.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:61) ... 9 more

Can there be a workaround for this? If not, would going back to the smaller 149 BAM files and removing duplicates there before merging be something which could work?

brettChapman commented 9 years ago

I'm removing duplicates using picard tools on each of the original 149 BAM files. Seems to be running ok. I assume it would be equivalent to if I ran it on each of the 15 merged BAM files. Seeing as its only identifying PCR and optical duplicates, these would unlikely appear across all the BAM files, only within each original RNA-seq run?

pjotrp commented 9 years ago

Hi Brett, that is more a question for Biostar. The rmdup routines in all tools are pretty brain dead and, in general, remove real information. I would not recommend a rmdup across different samples. Within a sample is often a good idea, but it will remove signal. I have ideas for a better approach, but no time to implement it. For Illumina rmdup is mostly default because the read depth is so high.

brettChapman commented 9 years ago

Thanks. I've looked on Biostar. there seems to be no real consensus on what to do. It's either leave the duplicates which may cause problems later or remove them potentially removing real information.

Other RNA-seq datasets I've worked with for my analysis with other genomes, I've removed duplicates after merging BAM files. I'd rather not go back and re-do all that work. As I'm only looking at spliced reads, the possibility of a duplicates across a splice junction from all the reads in my data is reasonably slim. Also it's probably more prudent to remove any false positives than to leave them in, as I'm generating new gene models with that data. Also I'm using illumina, as you say, there is a lot of read depth. I guess this could be something to talk about in my discussion and possibly brought up by the reviewers and my thesis examiners.

brettChapman commented 9 years ago

They recently improved the memory efficiency with Zythos, and so I was allowed to run the sambamba_latest script again which had the GC disabled. Theres been quite a long queue on Zythos, but eventually I ran with all 149 BAM files. I first removed duplicates from each BAM file because I knew I wouldn't be able to remove duplicates from the final merged file. I was able to run sambamba merge with no problems. I now have a 218 Gb BAM file. I was generating 15 different merged BAM files to run with my splice graph tool, but now I might as well use the single BAM file. I'll likely test it afterwards with the 15 to see how it performs as well, and if I end up with the exact same resulting splice peptide file. Thanks for all your help with this.

pjotrp commented 9 years ago

Thanks for the update Brett :). You are a power user of sambamba for sure.