Read IDs changed by STAR in fastq output are not compatible with some downstream tools

calizilla commented 5 months ago

When using the unmapped reads from STAR, some downstream tools throw an error due to the changes STAR makes to the read name.

Eg read names emit by STAR:

$ head -1 PHHST1_2_Unmapped.out.mate1 
@A00609:237:HWHLTDSXY:2:1110:26295:21762 0:N:  00
$ head -1 PHHST1_2_Unmapped.out.mate2 
@A00609:237:HWHLTDSXY:2:1110:26295:21762 1:N:  00

Compared to original read names:

$  zcat ../../Fastq_raw/PHHST1_1.fq.gz | head -1
@A00609:237:HWHLTDSXY:2:1101:1244:1016 1:N:0:GCTCGATT+GATCTACC
$ zcat ../../Fastq_raw/PHHST1_2.fq.gz | head -1
@A00609:237:HWHLTDSXY:2:1101:1244:1016 2:N:0:GCTCGATT+GATCTACC

I appreciate the 00/01/10 addition is a useful thing, but I don't understand why STAR has changed R1 reads from 1:N to 0:N and changed R2 reads from 2:N to 1:N. I have checked the STAR documentation for flags that allow me to retain the original Illumina read pair designation, but cannot find anything. Can another option be added to the existing flag outSAMreadID ?

In this case, the tool I am having an error with is funannotate:

[04/26/24 15:57:29]: R1 header: A00609:128:HHMHLDSXY:3:1102:26350:29340 0:N:  00 and R2 header: A00609:128:HHMHLDSXY:3:1102:26350:29340 1:N:  00 are not 1 and 2 as expected
[04/26/24 15:57:29]: ERROR: FASTQ headers are not properly paired, see logfile and reformat your FASTQ headers

I can circumvent this error by reverting the read IDs, but it is a needless step that could be prevented from enabling the read IDs to be emit correctly by STAR.

I am not the first to describe this issue: https://www.biostars.org/p/9536602/

Many thanks :-)

teryyoung commented 2 months ago

I got the issue like yours, I am using fgbio to process my RNA-seq data, which add /A /B suffix to reads' ID, but it seems STAR recognized it as identification of reads1 and reads2, so after STAR, my reads ID lost these /A or /B suffix. I am thinking use single thread run STAR to keep the same order of raw reads, then manually modifiy the read ID, it could take a long time.... Did you solve this problem in any other way?

Appreciation to you.

calizilla commented 2 months ago

@teryyoung I used bbtools filterbyname.sh to correct the reads - it was very fast and straightforward. The below script was executed in parallel over a list of samples on a cluster, with a PBS script that included module load bbtools/39.01:

#!/bin/bash

# Use bbtools to correct the changes STAR made to read names, that
# have caused funannotate train to fail - see https://github.com/alexdobin/STAR/issues/2129

sample=$1

star_dir=./STAR_align_trimmed
fastq_in=./Fastq_trimmed
fastq_out=./Fastq_fixed_unmapped_trimmed

# Make temp fastq ID list
#       STAR dir has lane in outfile names - dont need to account for this for this dataset, that 
#       is just one pair per sample, not super robust but just doing it this way for speed (fix if > 1 pair per sample) 
unmapped=$(ls ${star_dir}/${sample}_*Unmapped.out.mate1) 
temp_ids=${star_dir}/${sample}_unmapped_ids.tmp

awk '$1~/^\@/ {print $1}' ${unmapped} | sed 's/^@//' > ${temp_ids}

# Trimmed fastq, before STAR mapping: 
fq1_in=${fastq_in}/${sample}_1_trimmed.fq.gz
fq2_in=${fastq_in}/${sample}_2_trimmed.fq.gz

# Output - the unmapped against reference, but with read IDs uncorrupted by STAR: 
fq1_out=${fastq_out}/${sample}_1_trimmed.fq.gz
fq2_out=${fastq_out}/${sample}_2_trimmed.fq.gz

filterbyname.sh \
        names=${temp_ids} \
        in=${fq1_in} \
        in2=${fq2_in} \
        out=${fq1_out} \
        out2=${fq2_out} \
        include=t \
        overwrite=t

rm -rf ${temp_ids}

alexdobin / STAR

Read IDs changed by STAR in fastq output are not compatible with some downstream tools #2129