Terence also had some thoughts on format guessing:
WRT format guessing, another option is to not even guess, and support changing seq-ids in column 1 of any tab-delimited file, maybe with an option to convert seq-ids in a different column, like in the feature_table.txt.gz file such as this one:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/715/135/GCF_000715135.1_Ntab-TN90/GCF_000715135.1_Ntab-TN90_feature_table.txt.gz
And then if it happens to be 9 columns and has a Target= string in column 9, or a ##sequence-region pragma, then go ahead and change that, too.
For guessing assemblies, we’re apparently good about including the assembly in the header in both of our main FTP areas with GFF3:
We use the “#!” for non-official pragmas. All of those were supposed to get added into the GFF3 spec, but those efforts died. The “NCBI_Assembly” dbxref is an SO-defined dbxref, so it’s reasonably official.
Terence also had some thoughts on format guessing: