bcgsc / tr_catalog

Tandem repeat catalog from public long-read sequence assemblies
1 stars 0 forks source link

Sizes are not always integer valued #1

Closed lfearnley closed 3 months ago

lfearnley commented 3 months ago

Great resource!

I was looking at this after reading the preprint and noticed that at the FGF14 locus one of the sizes (which should be in nt) is not integer-valued - it has a size of 391.1. Is this an error?

readmanchiu commented 3 months ago

Thanks for looking at this resource. Hopefully it will help your research. Yes, some sizes are floats because they represent edits from Straglr genotypes (which is just average or median from sizes determined from supporting reads). I intentionally left them as floats to differentiate them from assembly-based sizing. For your example the assembly reconstructed two homozygous alleles but the ONT reads suggest there is a bigger haplotype.

lfearnley commented 3 months ago

Thanks for getting back to me so quickly!

I should also check an additional definition on that locus - it lists the below motifs:

AAG(521);AGAAGA(16);AGAAGAAGAAGCAGA(3);AGAAGAAGAAGA(1);AGAAGC(1)

AGAAGA and AGAAGAAGAAGA when rotated and collapsed give AAG.

Would those two motif calls be treated as non-AAG throughout the preprint?

readmanchiu commented 3 months ago

No, those 2 would not be treated as non-AAG . As you said, one of the permutations of the longer motifs when collapsed is AAG. The screening is handled by the analysis script.

lfearnley commented 3 months ago

Great, thanks for the clarification! It's much appreciated.