alexdobin / STAR

RNA-seq aligner
MIT License
1.87k stars 506 forks source link

Add index sequence to Unmapped read FASTQ headers #2223

Open andrewkennard opened 1 month ago

andrewkennard commented 1 month ago

ENHANCEMENT REQUEST:

I've noticed that when I run STAR with -outReadsUnmapped Fastx on Illumina reads that rather than just appending the mapping status of the read and the mate, the final field of the original Illumina header is modified in a way that removes the index sequence. It would be very nice to retain the index sequence in the headers of the unmapped reads.

Example of a read in the raw data and in the Unmapped file:

raw data:
@NS500540:129:HKJG2BGX7:4:13402:14458:19861 1:N:0:CGTAAG
unmapped file:
@NS500540:129:HKJG2BGX7:4:13402:14458:19861 0:N:  00

I have only tested this with STAR 2.7.9a but I didn't see anything about this issue in the changelogs of subsequent releases or in the issue tracker.

Why this would be useful: I ran many samples through STAR at the same time, which were only disambiguated by the index sequence. If I had access to the index sequence, I could easily identify which sample the unmapped read came from; right now I can only pinpoint which sequencing run it was from based on the name conventions in Illumina FASTQ headers.