NCBI-Hackathons / Master_gff3_parser

Convert sequence IDs between ucsc/refseq/genbank
MIT License
16 stars 5 forks source link

Additional thoughts on format guessing #3

Open childers opened 7 years ago

childers commented 7 years ago

Terence also had some thoughts on format guessing:

WRT format guessing, another option is to not even guess, and support changing seq-ids in column 1 of any tab-delimited file, maybe with an option to convert seq-ids in a different column, like in the feature_table.txt.gz file such as this one: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/715/135/GCF_000715135.1_Ntab-TN90/GCF_000715135.1_Ntab-TN90_feature_table.txt.gz

And then if it happens to be 9 columns and has a Target= string in column 9, or a ##sequence-region pragma, then go ahead and change that, too.

For guessing assemblies, we’re apparently good about including the assembly in the header in both of our main FTP areas with GFF3:

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/GFF/ref_GRCh38.p7_top_level.gff3.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.36_GRCh38.p10/GCF_000001405.36_GRCh38.p10_genomic.gff.gz

gff-version 3

!gff-spec-version 1.21

!processor NCBI annotwriter

!genome-build GRCh38.p10

!genome-build-accession NCBI_Assembly:GCF_000001405.36

We use the “#!” for non-official pragmas. All of those were supposed to get added into the GFF3 spec, but those efforts died. The “NCBI_Assembly” dbxref is an SO-defined dbxref, so it’s reasonably official.