Kingsford-Group / squid

SQUID detects both fusion-gene and non-fusion-gene structural variations from RNA-seq data
BSD 3-Clause "New" or "Revised" License
40 stars 22 forks source link

Annotation of output #16

Closed matq007 closed 5 years ago

matq007 commented 5 years ago

I have a question regarding annotation. I've been trying to convert the BEDPE format to bed file and then use AnnotSV tool to annotate the breakpoints. My issue is that every time I run the annotation I am not able to properly predict the positioning of the genes in the fusion. Do you have any tips on how to annotate the output? My goal is pretty simple: build a list of fusion genes. Thank you 😄

Congm12 commented 5 years ago

Hi.

I added a script to annotate the output of SQUID. It takes in the GTF file, squid output, and labels whether the TSV is fusion-gene (both breakpoints are inside genes and the TSV orientation agrees with the strands of the genes) or non-fusion-gene type. And if the TSV is fusion gene, it outputs the pair of gene names corresponding to the fusion event, which is separated by a colon.

I hope this script achieves your goal. Please clone the latest git repo (the script is located in the utils folder, and is not included in the binary relase), and check the new README to run the script.

And feel free to contact me if you have further questions.

matq007 commented 5 years ago

Hi again, I ran into a small issue.

Traceback (most recent call last):
  File "AnnotateSQUIDOutput.py", line 320, in <module>
    Transcripts = ReadGTF(GTFfile, key_gene_id, key_gene_symbol)
  File "AnnotateSQUIDOutput.py", line 101, in ReadGTF
    assert(e[0] in Transcripts)
AssertionError

Here is my gtf file, I'm using GRCh38 version.

head -n 10 genes.gtf
chr1    BestRefSeq      exon    11874   12227   .       +       .       gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1    BestRefSeq      exon    12613   12721   .       +       .       gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1    BestRefSeq      exon    13221   14409   .       +       .       gene_id "DDX11L1"; gene_name "DDX11L1"; transcript_id "rna0"; tss_id "TSS31672";
chr1    BestRefSeq      exon    14362   14829   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "rna1"; tss_id "TSS15911";
chr1    BestRefSeq      exon    14970   15038   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "rna1"; tss_id "TSS15911";
chr1    BestRefSeq      exon    15796   15947   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "rna1"; tss_id "TSS15911";
chr1    BestRefSeq      exon    16607   16765   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "rna1"; tss_id "TSS15911";
chr1    BestRefSeq      exon    16858   17055   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "rna1"; tss_id "TSS15911";
chr1    BestRefSeq      exon    17233   17368   .       -       .       gene_id "WASH7P"; gene_name "WASH7P"; transcript_id "rna1"; tss_id "TSS15911";
chr1    BestRefSeq      exon    17369   17391   .       -       .       gene_id "MIR6859-1"; gene_name "MIR6859-1"; transcript_id "rna3"; tss_id "TSS49116";

Commands I've ran

STAR \\
        --genomeDir ${index} \\
        --sjdbGTFfile ${gtf} \\
        --runThreadN ${task.cpus} \\
        --readFilesIn ${reads[0]} ${reads[1]} \\
        --twopassMode Basic \\
        --chimOutType SeparateSAMold --chimSegmentMin 20 --chimJunctionOverhangMin 12 --alignSJDBoverhangMin 10 --outReadsUnmapped Fastx --outSAMstrandField intronMotif \\
        --outSAMtype BAM Unsorted ${avail_mem} \\
        --readFilesCommand zcat
samtools sort Aligned.out.bam > Aligned.out.sorted.bam
samtools view -Shb Chimeric.out.sam > Chimeric.out.bam
squid -b Aligned.out.sorted.bam -c Chimeric.out.bam -o fusions
AnnotateSQUIDOutput.py ${gtf} fusions_sv.txt fusions_annotated.txt

Debug output I've extracted

e[0] = "rna0"
print(list(Transcripts.keys())) = []

Thank you very much for the help so far :)

Congm12 commented 5 years ago

I was expecting the GTF annotation contain transcript record denoted by the 3rd column. But I updated the script to include the case where GTF only contains exon record. Check the new script to see whether it works.

matq007 commented 5 years ago

Thank you very much! Worked like a charm 👍.