alexdobin / STAR

RNA-seq aligner
MIT License
1.83k stars 504 forks source link

Unmapped reads fastq - 0 and 1 in header #1394

Open MartaBenegas opened 2 years ago

MartaBenegas commented 2 years ago

Hi STAR team!

I was aligning some reads with the following command :

STAR --runThreadN 24 --genomeDir /data/shared/genome --readFilesIn /data/input/SRR937558_1.fastq.gz /data/input/SRR937558_2.fastq.gz --readFilesCommand gunzip -c --alignIntronMin 20 --alignIntronMax 1000000 --outFilterMismatchNmax 999 --outFilterMultimapNmax 20 --alignMatesGapMax 1000000 --outReadsUnmapped Fastx --outSAMattributes All --outSAMtype BAM Unsorted --outFileNamePrefix /tmp/results/SRR937558_ > /data/output/SRR937558_alignment_log.txt 2> /data/output/SRR937558_alignment_err.txt

And I've realized that the output FASTQ of the unmapped reds have 0 and 1 in the "pair member" part of the header instead of 1 and 2. Example header: image

My header:

marta@cyanobacteria:~/Downloads/CASES/Lorenzo_Federico/star_test$ zcat SRR937558_Unmapped_1.fastq.gz | head -n4
@SRR937558.913889 0:N:  00
CATGTATGTATGTATGTCTACACGCATGTATGTCTGCATGTCTGTCTGTCTGTCTGTCTGTCTGCATGTATTATGTATATATGTATGTGTACCACTTGCAT
+
CCCFDFFFHHHGDIJJIJIJJEHIJGIIHIJJIIIJJJJJJJJIIIIJGIIJJIJJGHIJIJJJJGJJIIIIJJGHIIJIJIGEECHHE>CE?EEDDEEF<
marta@cyanobacteria:~/Downloads/CASES/Lorenzo_Federico/star_test$ zcat SRR937558_Unmapped_2.fastq.gz | head -n4
@SRR937558.913889 1:N:  00
GTCCACAGCCTGAGGGCAGAGCGATGGCTCAGAGGTAAAGAGCCTCTCAGCTTCTCACTACTCCAGGTCCAGGGACGCTGATGCCTTCTCTTAACCTCTGC
+
@BCFFFFFHHHHHIJJJJIJJJJIIIFGJIIHIIIDHIJJJJIGHGIJJJGIJJJJJJJJIJCHHEHFFFFFEDDDDDDDDDDDDDDDDDDCDDCCCCCD>

I would like to assemble the reads that do not map with my genome using trinity, and this output is incompatible with it. This software runs seqtk previously and gives the following error:

Converting input files. (both directions in parallel)CMD: seqtk-trinity seq -A -R 1  <(gunzip -c /data/input/SRR937558_Unmapped_1.fastq.gz) >> left.fa
CMD: seqtk-trinity seq -A -R 2  <(gunzip -c /data/input/SRR937558_Unmapped_2.fastq.gz) >> right.fa
Error, found read_type 1 but expecting read_type 2

And this doesn't happen if I manually change the 0 for 1 and the 1 for 2 (for a subset of 6 sequences). My new fastq:

marta@cyanobacteria:~/Downloads/CASES/Lorenzo_Federico/star_test_single$ head -n4 6seqs_1.fastq 
@SRR937558.2 1:N: 
AGACACTTCTGGGAAACTAGGTGATATATCTGTTAAAAATGTTAACAGACAGCCTGACCTCTAAGTGTGATTTCTCAAGACTTGATTTATCACCACCAGGA
+
CCCFFFFFHHHHHJJJJJJJJEHIJJJJJJJJJJJJJGJJJIIJJJJJJJJJJJJJJJJJJJJIIGHHGFIIIIJHJGIHHHHGGFFFFFFFEEEDCDDD=
marta@cyanobacteria:~/Downloads/CASES/Lorenzo_Federico/star_test_single$ head -n4 6seqs_2.fastq 
@SRR937558.2 2:N: 
GTATTAAAATATCTCTGGAAAGTTATTCTAAGAAATATGTTTGGCTCAGAATCCTACATCCTGGTGGTGATAAATCAAGTCTTGAGAAATCACACTTAGAG
+
CBCFFFFFHHHHHJJJJJJJJJIIIJJJJJJJJJJJJJJIJJJJJJJJJJHIJJJJJIJJJJJIIIJGHIIJJJIJJJJEHHFHHHFFFFFDEEEEEDDDD

And now the error doesn't occur:

Converting input files. (both directions in parallel)CMD: seqtk-trinity seq -A -R 1  <(gunzip -c /data/input/6seqs_1.fastq.gz) >> left.fa
CMD: seqtk-trinity seq -A -R 2  <(gunzip -c /data/input/6seqs_2.fastq.gz) >> right.fa
CMD finished (0 seconds)
CMD finished (0 seconds)
CMD: touch left.fa.ok
CMD finished (0 seconds)
CMD: touch right.fa.ok
CMD finished (0 seconds)
Done converting input files.CMD: cat left.fa right.fa > both.fa

So my question is, is there a reason to use the 0 and 1 instead of 1 and 2? is there an option to change the output fastq format?

Kind regards, Marta.

alexdobin commented 2 years ago

Hi Marta,

There is no particular purpose in any characters after the space in the read ID line. You can safely delete them or rename them to match the expectations of the downstream tools.

Cheers Alex