bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Not all loci in output bedfile #34

Open ljohansson opened 3 months ago

ljohansson commented 3 months ago

Dear @readmanchiu,

I am using straglr via the vip pipeline (https://github.com/molgenis/vip/), as described in an earlier thread. Here, we use as an input a bed file with loci. However, I expected all loci to be in the output vcf. However, often loci are missing in the output. Which ones are in and out differs per sample.

We have not yet found the cause of the missing loci. Could it be that straglr filters loci based on quality? If so, is there an option to force all loci in the vcf and use the QUAL and filter columns to indicate the low quality, but keep the locus in the output vcf file?

Because we are using the @philres fork, it could be an issue related to that fork, but I believe this question is not related to the altered code. If you have any insights they would be very welcome.

readmanchiu commented 3 months ago

I haven't tried the vip pipeline - from what you wrote, you can specify the source of Straglr and you guys are using @philres fork. Straglr's only filtering is based on the number of supporting reads, and the number of events (number of loci, not the number of lines) should be the same between the tsv and bed files. I don't know if the @philres fork is doing any filtering when it's converting Straglr's output to VCF. Anyways, I've been asked to produce an VCF output. Right now I'm still at the investigation phase, but it's targeted for the next release.

readmanchiu commented 2 months ago

VCF output has been added to v1.5.0 Some loci may be missed possibly because provided targeted motif do not match detected motif. Feel free to send me data for investigation if possble.

ljohansson commented 2 days ago

Dear @readmanchiu, Apologies for not reacting sooner. I had missed your replies. Thank you for adding vcf output to straglr. In the meantime MOLGENIS VIP has created their own Straglr fork (https://github.com/molgenis/straglr). We have learnt that in the philres fork variants are filtered when the number of RU match the reference genome. In that case the repeat is considered not to be a variant.

readmanchiu commented 1 day ago

Running Straglr in the the genome scan mode will only report loci that are larger than the reference, whereas running it in the genotype mode (with loci-of-interest provided) will return genotypes of all loci regardless of whether they are the same as reference or not. I guess you should check in the new vcf output whether there is still any missing loci.