bioperl / bioperl-live

Core BioPerl 1.x code
http://bioperl.org
295 stars 182 forks source link

Sequin table format #349

Open lskatz opened 3 years ago

lskatz commented 3 years ago

Hi, I was wondering if there was any way to parse the NCBI Sequin tbl format? It is defined here: https://www.ncbi.nlm.nih.gov/projects/Sequin/table.html

I don't think I see any parser for it but I wanted be sure before writing my own. Thank you!

And the example starts like this.

>Feature Sc_16
1   7000    REFERENCE
            PubMed      8849441
<1  1050    gene
            gene        ATH1
<1  1009    CDS
            product     acid trehalase
            product     Ath1p
            codon_start 2
<1  1050    mRNA
            product     acid trehalase
[offset=2000]
1253    420 gene
            gene    YPR027C
1253    420 CDS
            product     Ypr027cp
            note        hypothetical protein
1253    420 mRNA
            product     Ypr027cp
2626    2535    gene
            gene    trnF
2626    2590    tRNA
2570    2535
            product     tRNA-Phe
lskatz commented 3 years ago

This format is used for Sequin for submitting sequences to genbank, but it has also turned up in the VADR package from NCBI most recently.

cjfields commented 3 years ago

I think there is a Bio::FeatureIO::table but I'm not sure whether that was developed for this particular NCBI format.

cjfields commented 3 years ago

Sorry, was mistaken. We do have a Bio::SeqIO::table but that doesn't mention anything about NCBI's table format. Saying that, it's possibly you could look at the structure for that one to build from.

hyphaltip commented 3 years ago

I would also look at the tools Jon Palmer has developed in @nextgenusfs https://github.com/nextgenusfs/funannotate which is python based but has some parsing of these tables to truncate and cleanup when we need to remove contigs or filter out contam overlapping regions.