alexdobin / STAR

RNA-seq aligner
MIT License
1.82k stars 502 forks source link

Wrong locations in reference genome #261

Closed Sofie04 closed 7 years ago

Sofie04 commented 7 years ago

Dear experts, We are using STAR 2.5.3a, but we encountered a problem we weren’t able to solve yet. When we map our sequences to the reference genome we use, the found locations are offset by a variable number of nucleotides compared to the correct location (typically a few 100s of bp), though the identified chromosome is correct. So when we visualize our results it shows much more mismatches than it should, although the mismatches are restricted to four. Note that this behavior is also clear from the sam files, as the match position is close to but different from what we find when looking at the reference genome manually. The error appears to be independent from the different options we selected, and was also present when we used an older version of STAR. Below you find the command we used and the output visualized with IGV. Command: /data/student_homes/public/Star/STAR --runMode genomeGenerate --genomeDir /data/student_homes/sofie.snoeck/Subtel/ --genomeFastaFiles /data/student_homes/sofie.snoeck/hTel_1-15K_1_10_12fasta.TXT --genomeSAindexNbases 7 /data/student_homes/public/Star/STAR --genomeDir /data/student_homes/sofie.snoeck/Subtel/ --readFilesIn /data/student_homes/sofie.snoeck/SRR1246791_1_trim_paired.fastq /data/student_homes/sofie.snoeck/SRR1246791_2_trim_paired.fastq --outFileNamePrefix /data/student_homes/sofie.snoeck/Subtel/ --outFilterMismatchNmax 4 --runThreadN 12 --alignEndsType EndToEnd

image

Hopefully you can provide us with a solution. Thanks in advance!

alexdobin commented 7 years ago

Hi @Sofie04

the most likely explanation is that the genome assembly used by STAR for mapping is slightly different from the one used by IGV, could you please check it? If this is not the case, I would need the genome (just one of the references) and a few reads to test it.

Cheers Alex

Sofie04 commented 7 years ago

Hi @alexdobin

We used the same genome assembly so this shouldn't cause any problem. Attached you can find the reference genome and also a specific read. This read is situated in the beginning of a chromosome and there's an offset of only one base. You can see this in the wordfile: in the reference genome the sequence starts at position 11, but the sam-file says position 12. Reads that are situated further away from the start shows larger offsets. Also attached is the fastq of the specific read sequence.

Best regards, Sofie

example.docx

read1.fastq.txt read2.fastq.txt hTel_1-15K_1_10_12fastaMetTelomeren.txt

alexdobin commented 7 years ago

Hi Sofie,

for this particular example, it seems like starting position of 12 is the correct one:

AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTCCTCAGCCTCTCAACCTGCTTGGGTTACAGGTATGAGCCCGGGTGCCTAGCCAAACATTCCAT$
           CTCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTTCCTCAGCCTCTCAACCTGCTTGGGTT

there are 3 mismatches at the beginning. Could you please send me another example with a larger difference?

Cheers Alex

Sofie04 commented 7 years ago

Hi @alexdobin,

You're answer has given me insight in another problem. You aligned our example with only three mismatches, when looking at the sam-file it gives four mismatches and one deletion. This looked peculiar to me and I've noticed that in every alignment there's exactly one deletion. In attachment, I send you two new examples, with both the wrong location and a wrong alignment as well.

Best regards, Sofie

Examples2.docx FirstReadsOfPair.txt SecondReadsOfPair.txt

alexdobin commented 7 years ago

Hi Sofie,

good point - I missed that deletion! Now I think I have the full explanation. There is an extra 015 symbol at the end of each line in your genome FASTA file (was it generated on Windows ?). It's invisible on screen (you can see it with hexdump), but STAR interprets it as "N" - hence you have extra N every 50 bases. This screws up the coordinates - more so as you go towards the end of the reference. Please delete this symbol - tr -d '\015' should work and let me know if it solves the problem.

Cheers Alex

Sofie04 commented 7 years ago

Hi @alexdobin ,

We are indeed working with Windows, so this is probably where it has gone wrong. But the solution you provided seems to work and our data is now correctly mapped, with the right locations.

Thank you very much for your quick replies! You have helped us a great deal!

Best regards, Sofie

alexdobin commented 7 years ago

Hi Sofie,

happy to hear the problem got resolved. Please let me know if you encounter any other problems.

Cheers Alex