RobinVanSchendel / SIQ

Sequence Interrogation and Qualification
3 stars 1 forks source link

UnmergedCorrectPositionFR #9

Closed Magz001 closed 8 months ago

Magz001 commented 1 year ago

Hi! I am trying to use SIQ to analyse my Illumina data and majority of my reads appear in the category UnmergedCorrectPositionFR (for example in one of my runs, 81415 reads from a total of 123759). This data is originated from Illumina sequencing of a target gene in potato (tetraploid species, highly heterozygous). Could that be the reason why majority of my reads are not properly merged by FLASH? Thanks!

RobinVanSchendel commented 1 year ago

That likely means that FLASH was unable to merge those reads because they have no overlap. How large is your amplicon? Generally Illumina results in 2x150bp (or 2x300 for a MiSeq). If your amplicon is thus larger than let's say 290bp there will be no overlap between R1 and R2 and FLASH cannot merge them. since these reads are categorized as unmergedCorrectPositionFR the reads do appear to start and end with the specified primer sequences that you used. The heterogeneity of the species should not be causing this problem.

Magz001 commented 1 year ago

Thanks a lot Robin, very useful info. Yes, indeed, my amplicon is 403 bp and I used a 2x300 MiSeq service, so the overlap between reads is not complete. Is there any alternatives for solving my problem here? In addition to R1 and R2 fastq files, I have an extra fastq file with the merged reads provided by the sequencing service company. Can I use this file instead? or perhaps can I change the parameters used by FLASH to merge my reads? Thanks again!

RobinVanSchendel commented 1 year ago

403 bp should give you enough overlap between the 2x300 bp MiSeq reads for FLASH to merge them. So it is strange that you have so many reads unmerged. I have set FLASH to be very strict though with merging as we found some situations where FLASH would incorrectly merge with more loose settings. Those settings are not adjustable though. However merging is optional in SIQ. If you only provide a file in R1 then it does not merge and uses a file as is. I would suggest to put here your already merged reads from the sequencing company. The alternative is that you first merge the reads and then feed those files to SIQ.

Note that if you have various SNPs inside your amplicon SIQ might discard those reads as it then finds two mutations in the same read. For example a SNP and a small deletion. If you run into these kind of problems, please let me know. One of the ideas we have for SIQ is to add handling for multiple reference sequences (in case of heterogeneity) and then choose the best matching one.

Magz001 commented 1 year ago

I have various SNPs in my amplicon and also some other kind of variations (insertions and deletions). This variation is very common in highly heterozygous species, such as potato, and even more in my case since I am working on a non-coding region in potato genome. This variation is not located in the target site, but elsewhere, but still there in the amplicon.

I tried now with my merged files and without specifying the primers sequences, to avoid the problem of the incorrect positions. I obtained the following:

TotalReads: 113110 MergedCorrect: 63027 MergedButWrong: 50083 MergedCorrectPositionFR: 70644

In the Top100BadReads sheet of the output file I see that all of them are categorised "false", for ExactReadFound. Maybe these are the ones discarded because of the various SNPs and variations?

I´ll take a look at the graphs by running SIQPlotteR, to see if I can get a mutation profile for each of my samples. Next time I will also consider shorter amplicons to avoid (as much as possible) the problem of intrinsic variation.

Thanks a lot for your kind help!