WGLab / LongGF

A computational algorithm and software tool for fast and accurate detection of gene fusion by long-read transcriptome sequencing
GNU General Public License v3.0
22 stars 1 forks source link

Simulated fusion not detected #15

Open iskandr opened 2 years ago

iskandr commented 2 years ago

Hi,

We're working on benchmarking LongGF on simulated long read data generated using badread. There are a minority of fusions which appear to not get detected where the fusion partners originate on opposite strands of DNA. The mutation in these cases I guess would be fusion with an inverted sequence, can you LongGF correctly call these kinds of compound mutations?

liuqianhn commented 2 years ago

@iskandr I am not sure what the problems it is now. If you have a smaller dataset with the gene fusion you discussed, I would like to test to see why.

dvantwisk commented 2 years ago

Hello,

Here is a quick example. The following link contains two files https://www.dropbox.com/sh/77ui9a3m5yrdte6/AACQFRukk_-9fBUIw1dRimpya?dl=0. They both regard a (rather absurd) 5000x pacbio coverage simulation of a single fusion transcript (no background transcripts). The simulated fusion transcript that we are trying to find is HPS1:WHAMMP3 and it appears to have this inverted property that is described in the original post.

The first file is a fastq.gz file containing these simulated reads of this single fusion transcript. The second file is a bam file that results after aligning with minimap2 splice against the hg38 genome, followed by sorting with samtools sort -n. Running this file through LongGF using the following command fails to find the HPS1:WHAMMP3 fusion:

  LongGF \
  test10_sorted.bam \
  Homo_sapiens.GRCh38.105.gtf \
  40 50 100 > test10_sorted.log

Let me know if you can replicate the issue and let us know if LongGF is designed to locate fusions like this.