cole-trapnell-lab / cufflinks

Boost Software License 1.0
310 stars 116 forks source link

Unusual behavior of cufflinks when a colon is present in the contig names #138

Open sarahfrail opened 8 months ago

sarahfrail commented 8 months ago

Hello, I recently encountered an issue where running cufflinks on a set of sam files that were aligned with hisat2 (command below). cufflinks would run on the files when run without a gtf file as reference, but failed with the following error when I provided the reference.

hisat2 command hisat2 -p 20 -S sample.sam -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz --summary-file samplesummary.txt -x genomeindex --rna-strandness RF --dta-cufflinks --no-unal

with reference command cufflinks -p 16 --library-type fr-firststrand --max-mle-iterations 10000 -g my.gtf -o ./mysample mysample.cuffdocssorted.sam

without reference command cufflinks -p 16 --library-type fr-firststrand --max-mle-iterations 10000 -o ./mysample mysample.cuffdocssorted.sam

error encountered when running with reference

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
File SM_01_WT_-N_night_rA_S1.wheader.docsorted.sam doesn't appear to be a valid BAM file, trying SAM...
[14:26:57] Loading reference annotation.
[14:26:57] Inspecting reads and determining fragment length distribution.
> Processing Locus ctg010490:2-223890:221501-2 [************             ]  49%
Error: this SAM file doesn't appear to be correctly sorted!
    current hit is at ctg000040:2-211194:1601, last one was at ctg000000:2-55635:50523
Cufflinks requires that if your file has SQ records in
the SAM header that they appear in the same order as the chromosomes names 
in the alignments.
If there are no SQ records in the header, or if the header is missing,
the alignments must be sorted lexicographically by chromsome
name and by position.

I am working with a recently assembled draft genome, and so the names of the contigs in the assembly happened to have some less typical characters in them, including a colon. I eventually fixed this issue by removing the colon in all the contig names in the reference genome and the .gtf annotation file, and re-aligning with hisat2.

Any ideas why this failed only when the .gtf reference was provided?