Open christopher-schroeder opened 1 month ago
Now I recognized that the motifs are basically the same but only shifted.
I provide a more complicated example:
$ cat results/str_candidates/merged.bed | grep 8500613
1 8500613 8502586 GCCGCCCCGTCCGGGAGGGAGGGGCCCCTCTGCCCG
1 8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC
1 8500613 8502586 GCCTCTGCCCGGCCGCCACCCGTCTGGGAAGTGAGGAGC
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT
1 8500613 8502586 CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGGTCAG
1 8500613 8502586 CCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1 8500613 8502586 CCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGGAGGTGGGGGGTCAG
1 8500613 8502586 GCCGCCCCGTCCGGGAAGTGAGGGGCCCCTCTGCCCG
cat results/str/818-14.bed | grep 8500613
1 8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC 2448.0 61.2 4 932.0 23.3 5
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT 1420.0 28.4 8 305.0 6.1 1
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT 1295.0 25.9 9 - - -
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT 1295.0 25.9 9 - - -
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT 1295.0 25.9 9 - - -
1 8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC 2448.0 61.2 4 976.0 24.4 5
1 8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC 2448.0 61.2 4 952.0 23.8 5
1 8500613 8502586 GCCTCTGCCCGGCCGCCACCCCGTCTGGGAAGTGAGGAGC 2448.0 61.2 4 952.0 23.8 5
Especially line 2 and 3 of the output confuse me.
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT 1420.0 28.4 8 305.0 6.1 1
1 8500613 8502586 GGGGGGGTCAGCCCCCCGCCCGGCCAGCCGCCCCGTCCGGGAGGTGAGGT 1295.0 25.9 9 - - -
The same motif on the same locus but with different genotypes for one sample.
when you specify the the loci in the bed file, there should one only be one locus (position) per line. If you expect multiple motifs, you should use some regex for STRs for the last (4th) column in the bed file. For example, if you expect CGG and CAG to show up in the same locus, you can use "CG". If you have no clue what to expect, you can use some very generic pattern, like "GC*", which may work for your VNTR example.
So you are only allowed to put in a a position once and if there might be different motifs at a single position I have to use regexp? Why would you implement it in such way? We are also searching for pathogenic motif changes in STRs so especially loci where samples show different motifs are of interest. So I know exactly what motif I am searching for and I also know the position. But sometimes I have several motifs and would like to know if this occurs in a sample and if so how much copies.
The usecase is that if you detect an STR, you would like to know if the detected copy number is very much outside the norm. For that you have to know the norm and genotype the other (control) samples to get this information.
Thanks, @christopher-schroeder for the info. The idea of the input bed file is that each line corresponds to a unique locus; however, flexibility is allowed at the motif (4th column). So you you basically put "a" if you only are sure there is an A in the repeat. This becomes silly so I would probably make a new release such that the motif will be optional - empty column allowed. There will be no screening against a pre-defined motif. I think by removing this screening against pre-defined motifs will allow everything (regardless how they different from the reference) that is detected from the sequences to be reported.
I have called a bunch of samples with straglr and merged the (chrom, start, end, motif) beds into one unified bed. I would like to use this merged bed to genotype each sample again. Sometimes different motifs occur at the same position, but straglr only calles one of them multiple Times. Example:
Here is a position of the merged bed file with multiple motifs at the same position.
I use this bed for genotyping. What I see:
What I am expecting: