Oshlack / Clinker

Gene Fusion Visualiser
MIT License
51 stars 12 forks source link

problem running Clinker on STAR-Fusion output #11

Open denisovg opened 5 years ago

denisovg commented 5 years ago

Clinker works fine in our hands on the test data provided together with the source code in the GitHub repository. However, we experience difficulty with producing meaningful output when STAR-Fusion results are used as input for Clinker. Although Clinker runs through completion, without error messages, but no data is found in the "alignment" subfolder after the run.

More specifically, we run the following command:

bpipe -p out=SKBR3 -p caller=data/SKBR3_star-fusion.fusion_predictions.tsv -p del="t" -p col="6,8" -p print="true" -p competitive=true -p header=true -p align_mem=32000000000 -p genome_mem=32000000000 -p fusions="TATDN1:GSDMB" $CLINKERDIR/workflow/clinker.pipe data/SKBR3.Left.fq.gz data/SKBR3.Right.fq.gz

where the first two lines in the "caller" file, data/SKBR3_star-fusion.fusion_predictions.tsv, are:

FusionName JunctionReadCount SpanningFragCount SpliceType LeftGene LeftBreakpoint RightGene RightBreakpoint JunctionReads SpanningFrags LargeAnchorSupport FFPM LeftBreakDinuc LeftBreakEntropy RightBreakDinuc RightBreakEntropy annots

TATDN1--GSDMB 229 324 ONLY_REF_SPLICE TATDN1^ENSG00000147687.18 chr8:124539025:- GSDMB^ENSG00000073605.18 chr17:39909924:- &GRPundef@HWI-EAS418:7:39:1563:750,&GRPundef@HWI-EAS418:7:50:1361:162,&GRPundef@HWI-EAS418:7:1:1026:266,&GRPundef@HWI-EAS418:8:42:1649:890,&GRPundef@HWI-EAS418:7:12:1541:540,&GRPundef@HWI-EAS418:7:84:992:1927,&GRPundef@HWI-EAS418:7:26:1419:1336,&GRPundef@HWI-EAS418:7:6:474:982,&GRPundef@HWI-EAS418:8:18:1298:1792,&GRPundef@HWI-EAS418:7:39:1061:1882,&GRPundef@HWI-EAS418:8:74:964:1246,&GRPundef@HWI-EAS418:7:8:1109:1374,&GRPundef@HWI-EAS418:8:31:1613:1324,&GRPundef@HWI-EAS418:7:70:175:886,&GRPundef@HWI-EAS418:8:31:1072:604,&GRPundef@HWI-EAS418:7:12:1635:1745,&GRPundef@HWI-EAS418:7:63:885:309,&GRPundef@HWI-EAS418:8:100:343:1523,&GRPundef@HWI-EAS418:7:88:363:1275,&GRPundef@HWI-EAS418:7:18:820:1514,&GRPundef@HWI-EAS418:7:5:1593:1989,&GRPundef@HWI-EAS418:7:34:1171:461................&GRPundef@HWI-EAS418:7:12:355:1524,&GRPundef@HWI-EAS418:8:63:602:1590,&GRPu ndef@HWI-EAS418:8:68:1193:532,&GRPundef@HWI-EAS418:7:47:999:1488,&GRPundef@HWI-EAS418:8:81:848:1401,&GRPundef@HWI-EAS418 :7:57:276:1261,&GRPundef@HWI-EAS418:8:35:947:752,&GRPundef@HWI-EAS418:7:86:1155:630 YES_LDAS 30.4759 GT
1.9219 AG 1.5628 ["CCLE","Klijn_CellLines","FA_CancerSupp","ChimerPub","INTERCHROMOSOMAL[chr8--chr17]"]

and the first few lines in the FASTAQ file SKBR3.Left.fq are (SKBR3.Right.fq is similar):

@HWI-EAS418:8:1:3:1091/1 GGGAGCGGCTTCGGGTGCCCTCGNTNGNNTNNNNNNNNNNNNNNNNNNNN +HWI-EAS418:8:1:3:1091 @AABBA9?B?>@1;;,#######!#!#!!#!!!!!!!!!!!!!!!!!!!! @HWI-EAS418:8:1:3:674/1 CACGTGGTCCACAAATGGTTGGCTTNTNNGNNNNNNNNNNNNNNNNNNNN +HWI-EAS418:8:1:3:674 BABA@@<BB=@797@?<?#####!#!!#!!!!!!!!!!!!!!!!!!!! @HWI-EAS418:8:1:3:1772/1 CGAGCCGTGAGCTCGCGCGACCACTNTNNCNNNANNNNNNNNGNNNNNNN +HWI-EAS418:8:1:3:1772 B?BABBA?>?>A;B>B=>:>#####!#!!#!!!#!!!!!!!!#!!!!!!! @HWI-EAS418:8:1:3:1421/1 GGGGTCCACGGGCAACAGCAACCTGNGNNGNNNGNNNNNNNNGNNNNNNN +HWI-EAS418:8:1:3:1421

In particular, I am wondering if the bpipe option -p col="6,8" is correct for the "caller" file I have been using. (The format of this file, produced by STAR-Fusion, is very different from the format of your test example).

Please, let me know if you need more info.

Regards, Gennady

breons commented 5 years ago

Hi Gennady, thanks for taking the time to try Clinker, let's get it working for you.

Your column parameter looks like it should work (plus you would generally receive an error if it wasn't).

Which stage are you getting up to in particular? By the sounds of it you're completing the first stage and the alignment hasn't been performed via STAR yet? Do you have a fst_reference.fasta that has appeared in your SKBR3\reference\ directory?

Thanks! Breon.

denisovg commented 5 years ago

Hi Breon,

Thanks for getting back to me promptly.

Yes, the file SKBR3/reference/fst_reference.fasta was produced by my run

Regards, Gennady

breons commented 5 years ago

No problem at all!

Breaking this problem down:

The next stage is the STAR alignment. I just noticed your FASTQ file input naming convention, could you please try and renaming and gzipping them so they reflect SKBR3_R1.fastq.gz and SKBR3_R2.fastq.gz, respectively?

Within the workflow\clinker.pipe this file naming structure is hard coded in (another user has brought this up in a previous issue https://github.com/Oshlack/Clinker/issues/9). I will endeavour to resolve this in the next version release to a catch all for everyone's naming conventions. Of course, if this is the problem and you have a standardised format for your FASTQ naming, we can make this change to your clinker.pipe as the other user did. If not, and this is a key requirement for you, then I will escalate the issue and rustle something up.

If you could make this change then rerun the pipeline after either deleting the contents of the previous Clinker run (the SKBR3a folder) or simply changing your out parameter to something like SKBR3a, you will get fresh output.

Let me know how you go! Breon.

denisovg commented 5 years ago

Hi Breon,

Thanks for your additional input.

Following your suggestion, I renamed the input FASTQ files to be SKBR3.Left.fastq.gz and SKBR3.Right.fastq.gz, respectively.

Furthermore, I have re-ran once again STAR-Fusion and then Clinker to produce a fresh output. Both the runs were completed smoothly, without any error messages.

The fst_reference.fasta file, as well as a number of files in the "genome" subfolder were produced:

SKBR3/reference: fst_reference.fasta

SKBR3/genome: chrLength.txt chrName.txt Genome SA chrNameLength.txt chrStart.txt genomeParameters.txt SAindex

However, the SKBR3/alignment folder is still empty.

One possibility I have been thinking about is whether the Clinker option -p col=6,8 is correct/sufficient for parsing the "caller" file produced by STAR-Fusion.

In the test example provided in Clinker Github repository, the "caller" CSV file, bcr_abl1.csv, is

"chrom1","base1","chrom2","base2" "chr22",23524426,"chr9",133729451

and the option -p col=1,2,3,4 specifies each the chromosome id and the coordinate inside the chromosome as a separate column.

However, the format of the "caller" TSV file produced by STAR-Fusion is different:

FusionName JunctionReadCount SpanningFragCount SpliceType LeftGene LeftBreakpoint RightGene RightBreakpoint JunctionReads SpanningFrags LargeAnchorSupport FFPM LeftBreakDinuc LeftBreakEntropy RightBreakDinuc RightBreakEntropy annots

TATDN1--GSDMB 235 315 ONLY_REF_SPLICE TATDN1^ENSG00000147687.18 chr8:124539025:- GSDMB^ENSG00000073605.18 chr17:39909924:- ..........

In each the column specified by the the option -p col=6,8 , chromosome id is separated from coordinate by a colon (:). I am not sure how Clinker handles this and whether it is "aware" that colon (:) is used here as a separator. (since we don't tell Clinker explicitly that STAR-Fusion results are used as input.)

Regards, Gennady

breons commented 5 years ago

Hi Gennaby,

Apologies, I should have made it clear, would you mind renaming the files to the below and trying again: SKBR3_R1.fastq.gz and SKBR3_R2.fastq.gz

The _R1 and _R2 are important in this case :).

In terms of the Star Fusion input, I think that stage has been completed successfully and the col="6,8" is correct. Clinker has been developed to recognise that when two columns have been specified, the input format must be something like chrA:123456 or chrB;123456.

If you look at the first two lines of the fst_reference.fasta file we can verify this. If the first line is a fusion gene (such as GENEA:GENEB) and the second line is basically a string of bases, then the STAR fusion input has been parsed successfully. If your first line is a single gene (such as GENEA), then there's been a problem during this stage

When you do rerun, could you please post your command line output of the generate_fst stage? There will be a count for how many fusions successfully were parsed from the Star Fusion input.

Thanks! Let me know how you go :).


EDIT Maybe just send me the whole bpipe output, not just the generate_fst stage.

denisovg commented 5 years ago

Thank you Breon!

Your suggestion works: the alignment folder is no longer empty.

However, I noticed that STAR-Fusion and Clinker actually have contradictory requirements for naming the input FASTQ files:

1) STAR-Fusion will produce an "empty" output file star-fusion.fusion_predictions.tsv if the input files named SKBR3_R1.fastq.gz and SKBR3_R2.fastq.gz (acceptable by Clinker) are used

2) on the other hand, Clinker's output subfolder "alignment" will be empty if the input files named SKBR3.Left.fq.gz and SKBR3.Right.fq.gz (acceptable by STAR-Fusion) are used.

Thus, one has to rename the files between the STAR-Fusion and Clinker runs!

Therefore, I believe it would be helpful in the future to relax the Clinker's FASTQ file naming requirements. Why don't you just use the order of files passed to bpipe to distinguish between the "left" and "right" FASTQs?

Regards, Gennady

breons commented 5 years ago

Hi Gennady,

Great to hear that you got it working! If you run into any other problems, please don't hesitate to ask. Happy to help.

That's a really interesting discovery! I will certainly be updating how FASTQ's get read in based on this thread and another, I will let you know when it's complete, but it will certainly be in the next version.

Cheers, Breon.