Open david-dayan opened 3 years ago
I can't recall where we landed on this when we chatted. Is everyone okay with changing the multi-snp probes so we don't toss them out in downstream analyses?
I'm not sure I understand. The Campbell pipeline doesn't allow us to call more than one SNP per forward primer. If we want to call multiple SNPs per amplicon we'll need to align the reads and use another SNP caller.
Sandra Bohn Faculty Research Assistant, State Fisheries Genomics Lab Coastal Oregon Marine Experiment Station, Oregon State University 2030 SE Marine Science Dr, Newport, OR, 97365 Office: MSB 238, 541-867-0242
From: David I Dayan @.> Sent: Wednesday, March 17, 2021 5:14 PM To: State-Fisheries-Genomics-Lab/GT-seq @.> Cc: Bohn, Sandra @.>; Mention @.> Subject: Re: [State-Fisheries-Genomics-Lab/GT-seq] Indels / Big Probe (#7)
[This email originated from outside of OSU. Use caution with links and attachments.]
I can't recall where we landed on this when we chatted. Is everyone okay with changing the multi-snp probes so we don't toss them out in downstream analyses?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/State-Fisheries-Genomics-Lab/GT-seq/issues/7#issuecomment-801522729, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AK3DEEV7NVG7JYWNTUG5S53TEFAVJANCNFSM4ZBL4AOQ.
The probe seq files for Ots and Omy use "-" to represent the allele tagged by one of the probes if the probe sequence occurs over either an indel or multiple variants (example from the Omy probes below). So they are scored as a single variant, but notated in the resulting genotype data with "-," which breaks a lot of downstream analyses.
Indel: Omy_vamp5-303 | A | - | TGGCCGTAGTAGTTGGTCA | TGGCCGTAGTTGGTCA | CTGCTTCCCAATTCAGTATCGTCTT | 0 | 0
TGGCCGTAGTAGTTGGTCA
TGGCCGTAGTT---GGTCA #insertions are mine to emphasize indel
Multi-SNP probe Omy_hsf2-146 | A | - | ATAATCTACTA | ATAATCTAACA | CCAACAATTGCAGCCTCATCTTAAT | 0 | 0
ATAATCTA-CT-A #insertions are mine to highlight SNPs
ATAATCTA-AC-A
I'm suggesting that we change the probe sequence files for multi-snp probes so that we don't toss these SNPs. For example, we could change Omy_hsf2-146 like so: Omy_hsf2-146 | A | C | ATAATCTACTA | ATAATCTAACA | CCAACAATTGCAGCCTCATCTTAAT | 0 | 0
I don't have a good suggestion for what to do with indels
Oh, ok. I've just been recoding the - to be consistent with other labs. For example, many of these markers are used for PBT and/or GSI, so they need to be coded the same way as the shared datasets for those analyses.
However, there may be some analyses that would benefit from a different way of coding these alleles. In those cases we may need to use a different pipeline.
Sandra Bohn Faculty Research Assistant, State Fisheries Genomics Lab Coastal Oregon Marine Experiment Station, Oregon State University 2030 SE Marine Science Dr, Newport, OR, 97365 Office: MSB 238, 541-867-0242
From: David I Dayan @.> Sent: Thursday, March 18, 2021 8:12 PM To: State-Fisheries-Genomics-Lab/GT-seq @.> Cc: Bohn, Sandra @.>; Mention @.> Subject: Re: [State-Fisheries-Genomics-Lab/GT-seq] Indels / Big Probe (#7)
[This email originated from outside of OSU. Use caution with links and attachments.]
The probe seq files for Ots and Omy use "-" to represent the allele tagged by one of the probes if the probe sequence occurs over either an indel or multiple variants (example from the Omy probes below). So they are scored as a single variant, but notated in the resulting genotype data with "-," which breaks a lot of downstream analyses.
Indel: Omy_vamp5-303 | A | - | TGGCCGTAGTAGTTGGTCA | TGGCCGTAGTTGGTCA | CTGCTTCCCAATTCAGTATCGTCTT | 0 | 0
TGGCCGTAGTAGTTGGTCA TGGCCGTAGTT---GGTCA #insertions are mine to emphasize indel
Multi-SNP probe Omy_hsf2-146 | A | - | ATAATCTACTA | ATAATCTAACA | CCAACAATTGCAGCCTCATCTTAAT | 0 | 0
ATAATCTA-CT-A #insertions are mine to highlight SNPs ATAATCTA-AC-A
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/State-Fisheries-Genomics-Lab/GT-seq/issues/7#issuecomment-802512420, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AK3DEEWYNU2GMBIQVIWACSDTEK6KJANCNFSM4ZBL4AOQ.
@david-dayan This prompted me to discover an issue with markers that use the "-" symbol to code multi-snp probes or indels. At some point I thought these were codes for missing data, but looking at the probe sequence files I now see that this is wrong. This raises an issue though, I'm pretty sure these all get tossed once you try to read them into most programs that expect ATCG or even 1,2,3,4 for alleles, which is fine enough for indels, but what about multi-snp alleles? Do you think we should just edit the probe sequence files to set the allele to one on the SNPs. For example, Omy_hsf2-146 uses A and - for its alleles because the probes vary at multiple positions. Can we change this to A and C so we don't toss the SNP when we import it in other programs?
@sandrabohn We have been encoding this way: A 1 C 2 G 3 T 4
CRITFC uses a different 3-digit system. I can look theirs up if we want to be compatible with them. I think it's: A 102 C 104 G 106 T 108