McMahonLab / geodes

Diel transcriptomics of freshwater lakes
2 stars 1 forks source link

gff formatting #7

Closed alexlinz closed 7 years ago

alexlinz commented 7 years ago

I get about 40,000 warnings that say "##sequence-region line missing" when I run htseq-count on my gff file. Is this an issue?

sstevens2 commented 7 years ago

Looks like that is just a line that gives the boundaries of the annotated region. It is optional but 'recommended' apparently so parsers can check that the annotations are within the bounds expected, it seems.

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md

alexlinz commented 7 years ago

Good to know that warning is likely not changing my results. But I'm still going to look into using the GFF validator program from your link or something similar instead of changing things in the gff files with awk and sed. I'm concerned that I might be unintentionally changing things that don't necessarily pop up warnings.

sstevens2 commented 7 years ago

Probably a good idea. You could also do a check to make sure what you think is changing is the only thing changing? Before by just printing the results of your command instead of actually changing them. Or after by doing a diff between the old version and the new. Or add a test/assertion in your code that will break it if more than what you expected changes?

alexlinz commented 7 years ago

I started using Genome Tools http://genometools.org/index.html for merging and tidying GFF files. I still need to do some bash lines because the GFF files from the metagenome assemblies a) don't start with a comment line stating which GFF format we're using and b) use -1 and 1 to indicate strand instead of the standard + and -, which crashes Genome Tools. I'll likely need to write a separate script for every type of reference genome I use, but Genome Tools seems to do a good job of catching and fixing most formatting issues.