bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
61 stars 9 forks source link

STR count per strand #10

Closed sabiqali closed 2 years ago

sabiqali commented 2 years ago

Hi,

I was testing out Straglr on some of the reads that we have available to us and there are some issues with respect to the strand output.

I noticed that the bed file has an overall allele size and copy number output for the entire locus.

But when I look at the tsv for the per read analysis, I do see the column labelled genotype which contains the size or copy number. But this number does not seem to align with the copy number in the copy number column for that particular read. Instead it matched the genotype information in the bed file.

Would it be possible to get the strand data from straglr and determine definitely which strand the copy number in the copy number column is associated with?

I have attached some sample output from straglr: chr9 27573525 27573542 GCCCCG 3616.9(28);14.8(79) e156b402-4d03-4050-97dd-3c11faf43946 667.8 4007 869 3616.9 #with genotype_in_size flag set chr9 27573525 27573542 GCCCCG 602.8(28);2.5(79) e156b402-4d03-4050-97dd-3c11faf43946 667.8 4007 869 602.8 #without genotype_in_size flag

readmanchiu commented 2 years ago

The bed file gives the overall genotype of the locus, whereas the tsv file provides more detailed info of each supporting read. The copy number and size columns correspond to individual support reads, they are then used for clustering into specific alleles and the median of each cluster will be designed the size or copy number of the allele (cluster). Therefore the copy number or size in the tsv file will not be identical to the copy number or size reported in the genotype column; in the genotype column, the copy number or size values are the medians of the clusters that are made up a subset of the the support reads. Hope this makes sense. You want the strand info for each supporting read to be reported, is it for assessing the validity of the reported event? i.e. a mix of positive and negative strands will lend more confidence to the event? Right now the strand info is not reported but it can definitely be incorporated into the tsv output.

sabiqali commented 2 years ago

hi @readmanchiu ,

Yes, that makes sense. The output format makes sense from the perspective that it is reported.

I was asking if the strand information could be explicitly reported. That would be really helpful with verification of the counts before we move forward.

Thank you!

readmanchiu commented 2 years ago

I usually go to the BAM file, locate the support read, and use the read_start together with the size to locate the repeat sequence to verify the reported size (copy number is then calculated by dividing the repeat size with motif length) Would that be sufficient to verify the reported counts? or you think it's such a common artifact to have all support reads coming from the same strand that checking the strand info is critical for assessing the validity of the detected events?

sabiqali commented 2 years ago

Hi, that is a good way to check but some of our datasets have some other artifacts in them and more importantly for us, it would really help when comparing with other tools. That would help us automate the process as well rather than manually checking.

readmanchiu commented 2 years ago

@sabiqali, I have added a "strand" (+/-) column in the tsv output now

sabiqali commented 2 years ago

@readmanchiu, thank you for that. I shall test it out soon.

sabiqali commented 2 years ago

Hi, @readmanchiu the update fixed my issue thank you!