AdamaJava / adamajava

Other
13 stars 4 forks source link

addition of hg38 telomere coordinates for qmotif #346

Open squigzzz opened 5 months ago

squigzzz commented 5 months ago

Hello-

I am interested in deploying qmotif to quantify telomeres however I have HG38 aligned bams, is it possible to just liftover the coordinates you provided in the configureration file to hg38 using ucsc or a similiar tool or can you all make the appropriate hg38 coordinates available for this purpose ?

holmeso commented 5 months ago

Hi, thanks for your query. We used liftover to create a GRCh38 version of the coordinates.

[PARAMS]
stage1_motif_string=TTAGGGTTAGGGTTAGGG
;stage1_motif_string=TTAGGGTTAGGGTTAGGGTTAGGG
;stage2_motif_string=TTAGGG
stage2_motif_regex=(...GGG){2,}|(CCC...){2,}
;;stage2_motif_regex=((TTA|TCA|TTC|GTA|TGA|TTG|TAA|ATA|CTA|TTT|TTAA)GGG){2,}|(CCC(TAA|TGA|GAA|TAC|TCA|CAA|TTA|TAT|TAG|AAA|TTAA)){2,}
stage1_string_rev_comp=true
window_size=10000
includes_only=false

[INCLUDES]
;; name, regions (sequence:start-stop)
chr1p   chr1:10001-12464
chr1q   chr1:248943708-248946421
chr2p   chr2:10001-12592
chr2q   chr2:242146750-242148749
chr2xA  chr2:242181358-242183529
chr3p   chr3:18323-20322
chr3q   chr3:198233559-198235558
chr3xB  chr3:198170705-198176526
chr4p   chr4:10001-12193
chr4q   chr4:190120458-190123120
chr5p   chr5:10001-13806
chr5q   chr5:181476259-181478258
chr6p   chr6:60001-62000
chr6q   chr6:170743979-170745978
chr7p   chr7:10001-12238
chr7q   chr7:159333868-159335972
chr8p   chr8:60001-62000
chr8q   chr8:145076636-145078635
chr9p   chr9:10001-12359
chr9q   chr9:138260981-138262980
chr10p  chr10:14061-16061
chr10q  chr10:133785144-133787421
chr11p  chr11:60001-62000
chr11q  chr11:135074564-135076621
chr12p  chr12:43740-45739
chr12q  chr12:133262872-133265308
chr12xC chr12:10001-12582
chr13p  chr13:18445861-18447860
chr13q  chr13:114342403-114344402
chr14p  chr14:18243524-18245523
chr14q  chr14:106879333-106881349
chr15p  chr15:19794748-19796747
chr15q  chr15:101978766-101981188
chr16p  chr16:10001-12033
chr16q  chr16:90226345-90228344
chr17p  chr17:150208-152207
chr17q  chr17:83245442-83247441
chr18p  chr18:10001-12621
chr18q  chr18:80256343-80259271
chr19p  chr19:60001-62000
chr19q  chr19:58605455-58607615
chr20p  chr20:79360-81359
chr20q  chr20:64332167-64334166
chr21p  chr21:8522361-8524360
chr21q  chr21:46697876-46699982
chr22p  chr22:15926017-15927980
chr22q  chr22:50804138-50806137
chrXp   chrX:10001-12033
chrXq   chrX:156028068-156030894
chrYp   chrY:10001-12033
chrYq   chrY:57214588-57217414
;..

[EXCLUDES]
; regions (sequence:start-stop)
;chr1:143274114-143274336
;..
squigzzz commented 5 months ago

great thank you, it is quite unclear in both the publication and all of the documentation how you go from the XML output to telomere length in kb ? how is this done ?

holmeso commented 3 months ago

We don’t normalise directly to genome coverage. Rather, we simply scale to a nominal read count of 1B reads to allow for simple comparisons between BAMs with different numbers of reads. So if your BAM has 0.5B reads, all of the scaled scores will be double the raw counts and if your BAM has 2B reads, the scaled scores would be half of the raw numbers. We don’t take any account of unmapped reads, secondary alignments etc when scaling, we just count every read. We take this simple approach because when you are talking about tumours, the correct approach is non-obvious - for example, if we have 3 chromosomes with whole-arm amplifications, how should we take account of that? Clever/correct scaling is left as an exercise for the user as they know their data best. With all of those caveats, qMotif scaled scores correlate very well with wet-lab techniques as we showed in the qMotif paper so we think the simple scaling approach probably works well enough in the majority of cases.