Sequin table format #349

lskatz opened 3 years ago

lskatz commented 3 years ago

Hi, I was wondering if there was any way to parse the NCBI Sequin tbl format? It is defined here:

I don't think I see any parser for it but I wanted be sure before writing my own. Thank you!

And the example starts like this.

>Feature Sc_16
1   7000    REFERENCE
            PubMed      8849441
<1  1050    gene
            gene        ATH1
<1  1009    CDS
            product     acid trehalase
            product     Ath1p
            codon_start 2
<1  1050    mRNA
            product     acid trehalase
1253    420 gene
            gene    YPR027C
1253    420 CDS
            product     Ypr027cp
            note        hypothetical protein
1253    420 mRNA
            product     Ypr027cp
2626    2535    gene
            gene    trnF
2626    2590    tRNA
2570    2535
            product     tRNA-Phe
lskatz commented 3 years ago

This format is used for Sequin for submitting sequences to genbank, but it has also turned up in the VADR package from NCBI most recently.

cjfields commented 3 years ago

I think there is a Bio::FeatureIO::table but I'm not sure whether that was developed for this particular NCBI format.

cjfields commented 3 years ago

Sorry, was mistaken. We do have a Bio::SeqIO::table but that doesn't mention anything about NCBI's table format. Saying that, it's possibly you could look at the structure for that one to build from.

hyphaltip commented 3 years ago

I would also look at the tools Jon Palmer has developed in @nextgenusfs which is python based but has some parsing of these tables to truncate and cleanup when we need to remove contigs or filter out contam overlapping regions.