biostars / biostar-handbook

Issue tracker for the Biostar Handbook
57 stars 12 forks source link

Quantification with featureCounts in practice #320

Open BioinfGuru opened 8 months ago

BioinfGuru commented 8 months ago

Hi,

I am currently applying what I learned from RNA-seq by example to my own dataset and have run into what seems to be a common issue when passing an NCBI annotation file (GFF format) to featureCounts.

The feature.gff in the RNA-seq by example demo has this format: gene_name=AAA-750000-UP-4; gene_id=AAA-750000-UP-4; transcript_id=AAA-750000-UP-4-T; exon_number=1;

But NCBI datasets genome.gff has this format: ID=geneLOC100125545;Dbxref=GeneID:100125545;Name=LOC100125545;gbkey=Gene;gene=LOC100125545;gene_biotype=protein_coding

The error caused is pretty common

But the fixes are really a nuisance for someone who is new to programming. Many are trying to parse the GFF files, to create SAF (or other weird) formats.

Just a suggestion that maybe it would be helpful to point out to the reader that a GTF format as provided by ensembl will work without any complex editing of the reference annotation files.

Love the book. BIG help.

ialbert commented 8 months ago

This is a good point; it is a frustration we all feel.

In the Grouch Grinch section, I go into great detail on what kinds of counting troubles one might run into:

https://www.biostarhandbook.com/books/rnaseq/grinch-count.html

But I should probably make this point early on to prepare the reader for what is ahead.