arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
140 stars 85 forks source link

BED/GFF headers #70

Open fgvieira opened 11 years ago

fgvieira commented 11 years ago

Right now they have to start with '#' (comment) and are generally discarded from output (eg. subtractBed).

It would be nice if headers could be properly handled and printed to the output. Maybe add an option that would not parse the first line and just print it accordingly.

sjackman commented 10 years ago

:+1: In particular, the GFF version 3 header ##gff-version 3 must be maintained.

arq5x commented 10 years ago

Is the -header option not working for you? The example below is from the bedtools2 repository (please file issues there):

curl -s ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz | gzcat | head -20 > test.gtf

bedtools --version
bedtools v2.19.1

bedtools intersect -header -a test.gtf -b test.gtf | head
##description: evidence-based annotation of the human genome (GRCh37), version 19 (Ensembl 74)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-12-05
chr1    HAVANA  gene    11869   14412   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    11869   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    11869   12227   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    12613   12721   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  gene    13221   14409   .   +   .   gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
arq5x commented 10 years ago

Ah, I see now that you are referring to the fact that some of the tools don't support this functionality. In bedtools2, we are slowly working through standardizing the API for all of the tools. Once done, the result will be that all of the tools (when relevant) will support the -header option.

sjackman commented 10 years ago

bedtools sort -header works perfect! Thanks, Aaron. I was reading this documentation which doesn't show the -header option.

I expected -header to be the default behaviour. Perhaps instead a -noheader option?

arq5x commented 10 years ago

I see your point Shaun. The problem with this, however, is that such a change could impact many existing pipelines that are crafted around the assumption that headers will not be emitted by default. I think once we standardize the API this would be something worth revisiting with users on the mailing list to seek feedback about the impact.