cole-trapnell-lab / cufflinks

Boost Software License 1.0
310 stars 116 forks source link

Cufflinks produces very strange output when MicroRNA is annotated inside a intronic region of a gene #52

Open sajvanderzeeuw opened 8 years ago

sajvanderzeeuw commented 8 years ago

I first posted this problem on the cufflinks mailing list but no replies yet ( https://groups.google.com/forum/#!topic/tuxedo-tools-users/_E94jkdvMak )

The problem we are experiencing is that in GRCH38 annotation from refseq the gene HLA-B is annotated on chromosome 6 and in between its exons 4,5 there is an MiRNA annotated called MIR6891. The quantification with Cuffquant goes terribly wrong here as the table below will show:

image

HLA-B 1340.16 994.534 923.688 1650.58 1266.27 2167.43 2692.21
MIR6891 329936 167527 113865 399491 282248 82857.6 114646

As you can see i got counts for the MIR which go through the roof while the HLA-B gene is relatively low expressed compared to the MIR. When checking in IGV or UCSC genome browser i see that there is not a single read aligning to the MIR but a lot of split reads cover the region. Our current guess is now that these split reads are all asigned to the MIR as well, while the actual bases are aligned to exon 4 and 5 of the HLA-B gene. I know this is probably not easy to fix, but maybe a good idea to distribute a GTF file containing only MRNAs and LINCRNA. Or something like that, i wonder if other users experienced the same issue and how they circumvented the issue. Thanks already!

Czh3 commented 8 years ago

yes, I found this too. When the genes are short (shorter than reads' length), cuffnorm given a VERY VERT VERT large FPKM. Hope fixing this.