alexdobin / STAR

RNA-seq aligner
MIT License
1.84k stars 505 forks source link

Some sequences are empty when using alignments in MATLAB #1767

Open craneab opened 1 year ago

craneab commented 1 year ago

Hi,

I'm having an issue where sometimes the aligned DNA sequences are empty.

Background

I'm using STAR version 2.7.10b running in the Windows Subsystem for Linux using Ubuntu 22.04.1 LTS, with 128GB of RAM and a 16-core i9-12900KF.

I had three drosophila genomes that I had sequenced. I am mapping the fastq files from sequencing using the following commands: ./STAR \ --runThreadN 16 \ --genomeDir genomes2 \ --readFilesIn 'Ib fastq'/221213Lit_D23-1501_1_sequence.fastq 'Ib fastq'/221213Lit_D23-1501_2_sequence.fastq \ --sjdbGTFfile genomes2/Drosophila_melanogaster.BDGP6.32.108.gtf --outSAMtype BAM SortedByCoordinate

The resulting BAM file I am renaming 'Ib.bam' and loading into MATLAB using the BioMap command BioMap('Ib.bam'). BioMap is a built-in class in the MATLAB bioinformatics toolbox which has useful functions I can use to retrieve sequence information. I want to extract the alignment for single genes at a time, which I am doing using alignment = getcompactalignment('Ib.bam', startposition, endposition, chromosome).

This gives me the DNA sequences aligned like this between the start and end positions in the given chromosome: ACTGTGACGT--- --TGTGACGTAA- ---GGGACGTAAG

My goal is to find the percent of nucleotides expressed at each position (in the example above, the third position is 100% Ts, the fourth is 100%G, the fifth is 66% T and 33% G).

The problem

Sometimes MATLAB throws an error when trying to extract alignments at certain positions in the genome. Confusingly, it only happens at some genomic locations but is fine at others within the same aligned genome. More confusingly, between the three genomes that I aligned, the problem does not necessarily happen at all three in the same place - ie, genome 1 will extract correctly but genome 2 and 3 will not for a given gene. I have done some troubleshooting and I found out that sometimes alignments seem to be empty, and this is causing the problem.

The command getcompactalignment() calls another built-in function, cigar2gappedsequence(), which uses the cigar strings to align the sequences for output. I can pause the program while inside the cigar2gappedsequence function and I see that there is a seqs variable containing all the sequence values (ACTG etc) and a cigars variable containing the corresponding cigar values (40M, etc). When I am trying to align a certain gene, both of these are [~30,000 x 1] cell arrays, and inside of each {cell} is a character array with the actual values: seqs: {'ACTG'} {'CTGT'} {'GGTA'} ... to ~30,000.

Here's the critical bit: when the program can successfully extract the alignments, all 30,000 rows are filled with sequence values. When MATLAB throws an error, I can see that only about the first 600 rows have sequences, after which the remaining rows are empty doubles. MATLAB is throwing an error because it is expecting characters, and instead part of the sequences are these empty doubles. I think that MATLAB is recognizing that there are 30,000 sequences (and thus creating the correctly sized array), but then when it loads for some reason they are empty. seqs: {'ACTG'} {'CTGT'} {'GGTA'} ... to ~600 {[ ]} {[ ]} {[ ]} ... to ~30,000.

Can you think of any reason this would occur? I've used getcompactalignment() quite a bit and it's always worked before, though previously I was using sequences aligned by someone else using bowtie.

alexdobin commented 1 year ago

Hi @craneab I do not have experience with the MATLAB package you are using, unfortunately.