Oshlack / Clinker

Gene Fusion Visualiser
MIT License
51 stars 12 forks source link

EXITING because of INPUT ERROR: the file format of the genomeFastaFile #23

Open JAYRJPT opened 3 years ago

JAYRJPT commented 3 years ago

Hello I am using Clinker to visualize one of the fusions came from fusion catcher tool. I have made a csv file with the coordinates of the fusion named DUX4:IGH@ similar to the bcr_abl1.csv file mentioned in test folder. Here is my command- bpipe -p out=/home/deepak/output -p caller=$CLINKERDIR/test/caller/dux4_igh.csv -p col=1,2,3,4 -p genome=38 -p print=true -p competitive=true -p header=true -p align_mem=31025992405 -p genome_mem=31025992405 -p threads=30 -p fusions=DUX4:IGH@ $CLINKERDIR/workflow/clinker.pipe $CLINKERDIR/test/fastq/*.fastq.gz

But I am getting the error at the alignment step

====================================================================================================
|                              Starting Pipeline at 2021-07-02 23:25                               |
====================================================================================================

======================================== Stage generate_fst ========================================

==============================================================

    Fusion Super Transcript Generator

    A fusion visualiser.

==============================================================

==============================================================

Create fusion superTranscriptome:

WARNING: a gene (line 0 of fusion input) does not exist in annotation/hg19_ucscGenes.txt based upon breakpoint.
         Closest mapped gene name is 'RABL2B' (139512811 bp downstream)

--------------------------------------------------------------
Gene Symbols Mapped: 0 Not Mapped: 1 Total: 1
--------------------------------------------------------------

Note: Some superTranscripts were not generated. This could be because of:
    A: The breakpoint was not within a gene (this program only deals with these).
    B: The superTranscript reference file did not contain an entry for that gene symbol.
    C: You have identified the wrong columns, or they contain the wrong information, with the -pos argument.

==============================================================

Creating output directory at: /home/deepak/output
Creating fused superTranscriptome and annotation files

...Success!

Use the plot_fst bpipe workflow or IGV to visualise your results.

==============================================================

====================================== Stage star_genome_gen =======================================
Jul 02 23:25:31 ..... started STAR run
Jul 02 23:25:31 ... starting to generate Genome files

EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /home/deepak/output/reference/fst_reference.fasta is not fasta: the first character is '
' (10), not '>'.
 Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

Jul 02 23:25:31 ...... FATAL ERROR, exiting
ERROR: stage star_genome_gen failed: Command in stage star_genome_gen failed with exit status = 104 : 

STAR --runMode genomeGenerate --runThreadN 30 --genomeDir /home/deepak/output/genome --genomeFastaFiles /home/deepak/output/reference/fst_reference.fasta --limitGenomeGenerateRAM 31025992405 --genomeSAindexNbases 5 

========================================= Pipeline Failed ==========================================

Command in stage star_genome_gen failed with exit status = 104 : 

STAR --runMode genomeGenerate --runThreadN 30 --genomeDir /home/deepak/output/genome --genomeFastaFiles /home/deepak/output/reference/fst_reference.fasta --limitGenomeGenerateRAM 31025992405 --genomeSAindexNbases 5

Use 'bpipe errors' to see output from failed commands.

Here is the bpipe error

deepak@ngs:~/ClINKERDIR$ bpipe errors

============================== Found 1 failed commands from run 26797 ==============================

=================================== Command star_genome_gen (68) ===================================

Command    : STAR --runMode genomeGenerate --runThreadN 30 --genomeDir /home/deepak/output/genome --genomeFastaFiles /home/deepak/output/reference/fst_reference.fasta --limitGenomeGenerateRAM 31025992405 --genomeSAindexNbases 5
Started    : Fri Jul 02 23:25:31 IST 2021
Stopped    : Fri Jul 02 23:25:31 IST 2021
Exit Code  : 104
Config: 
                   Name           |  Value 
          ---------------------------------
          max_per_command_threads | 16     
          executor                | local  
          stats_update_interval   | 120000 
          outputScanConcurrency   | 5      
          maxFileNameLength       | 2048   
          name                    | stargen
          procs                   | 1      

Output    : 

    Jul 02 23:25:31 ..... started STAR run
    Jul 02 23:25:31 ... starting to generate Genome files

    EXITING because of INPUT ERROR: the file format of the genomeFastaFile: /home/deepak/output/reference/fst_reference.fasta is not fasta: the first character is '
    ' (10), not '>'.
     Solution: check formatting of the fasta file. Make sure the file is uncompressed (unzipped).

    Jul 02 23:25:31 ...... FATAL ERROR, exiting

Any suggestion to remove this error?

Thanks and Regards,

Jay

breons commented 3 years ago

Hi Jay, thanks for trying Clinker!

That error comes during the first stage (generate_fst) where the superTranscripts cannot be located in the reference files given the inputted coordinates.

I noticed hg19 has a IGH@ gene, but not hg38 (at least in Clinker's reference). Did the fusion caller us hg19? If so, simply delete the current output and change your -p genome=38 to -p genome=19.

If you're sure it's hg38, then I'll have to look into why that is missing.

Cheers, Breon.

JAYRJPT commented 3 years ago

Hi Breon, I have used Fusioncatcher and it has used hg38 as reference genome. I have mentioned the coordinates of the gene according to hg38 only.

Thanks, Jay

breons commented 3 years ago

Hi Jay,

Sorry for the delay. I will need to rebuild the references to account for IGH@ in hg38 - it seems Clinker currently doesn't have a superTranscript for that. The bad news is that it might take me some time to get together as I am currently finishing some other projects.

However, I'm a bit confused as to why RABL2B is coming up as the closest gene (chr22), when DUX4 and IGH@ are on other chromosomes in the hg38 reference? Would you mind sharing the csv with the coordinates in them? Otherwise, just double check the positions are accurate.

Thanks! Breon.