ifiddes commented 6 years ago

Here is what NCBI complained about:

I looked over your .tbl file and used it to annotate 200 of the sequences in your submission (the first 100, and then 301-400th). I had to split the fasta file like that because tbl2asn was choking on the fasta file that was >2Gb. Here’s the command line for the first 100 (I didn’t include the discrepancy report because the errors were enough of an issue to deal with):

tbl2asn -i clint_ctg_200bp_exclude0TooBig00000001.fsa -f Clint_Chimp.tbl -M n -j "[organism=Pan troglodytes]" -t SubmissionTemplate.sbt -o clint_ctg_200bp_exclude0TooBig00000001.annot.1003.sqn &

fyi, it spews a warning for each identifier in the .tbl file that isn’t in the .fsa file. But that’s expected with this set-up since there’s only 100 sequences in the fasta file.

Please let us know of any questions.

Thanks, Karen

Here are the errors from those first 100 sequences:

================================================================= 28 REJECT-level messages exist

SEQ_FEAT.Range 28

================================================================= 61568 ERROR-level messages exist

SEQ_FEAT.MissingGeneXref 15 SEQ_FEAT.InternalStop 15041 SEQ_FEAT.MissingCDSproduct 64 SEQ_FEAT.NoStop 3386 SEQ_INST.StopInProtein 15086 SEQ_FEAT.UnindexedFeature 16 SEQ_FEAT.NoProtein 19 SEQ_FEAT.WrongQualOnFeature 2240 = ncRNA_class on mRNAs SEQ_FEAT.CDSwithMultipleMRNAs 231 SEQ_FEAT.AbuttingIntervals 114 SEQ_FEAT.StartCodon 12299 SEQ_FEAT.MissingQualOnFeature 391 = missing ncRNA_class on ncRNA SEQ_FEAT.ShortIntron 77 SEQ_FEAT.BadInternalCharacter 76 = gnl|CK280|T..... as the product name SEQ_FEAT.PartialProblem 20 SEQ_FEAT.FeatureProductInconsistency 30 SEQ_FEAT.TransLen 25 = extends past Stop codon (maybe an internalstop codon?) SEQ_FEAT.PseudoCdsViaGeneHasProduct 101 SEQ_INST.BadProteinStart 12336 SEQ_INST.ShortSeq 1

================================================================= 485337 WARNING-level messages exist

SEQ_FEAT.FeatContentDup 771 SEQ_FEAT.ShortExon 213 SEQ_FEAT.NotSpliceConsensusDonor 222540 SEQ_FEAT.InconsistentPseudogeneValue 177 SEQ_FEAT.MultipleGeneOverlap 141 SEQ_FEAT.PartialProblem 8842 SEQ_FEAT.NotSpliceConsensusAcceptor 231978 SEQ_FEAT.ProductLength 3 SEQ_FEAT.CDSmRNAmismatch 17201 SEQ_FEAT.CDSmRNArange 1286 SEQ_FEAT.CDSwithNoMRNA 271 SEQ_FEAT.DuplicateFeat 1914

================================================================= 14702 NOTE-level messages exist

SEQ_FEAT.RareSpliceConsensusDonor 14682 SEQ_FEAT.PseudoCDSmRNArange 20

=================================================================

These are the kinds of errors that are definitely problems:

SEQ_FEAT.Range = the location of the feature extends beyond the end of the sequence SEQ_FEAT.InternalStop/SEQ_INST.StopInProtein SEQ_FEAT.TransLen = extends past Stop codon (maybe an internalstop codon?) SEQ_FEAT.NoStop = needs a stop codon OR to be partial SEQ_FEAT.StartCodon/SEQ_INST.BadProteinStart = needs a start codon OR to be partial SEQ_FEAT.AbuttingIntervals = needs the “low-quality sequence region” exception SEQ_FEAT.ShortIntron = needs the “low-quality sequence region” exception SEQ_INST.ShortSeq = protein <10 amino acids. The preference is for proteins to be at least 30aa.

SEQ_FEAT.BadInternalCharacter = gnl|CK280|T..... as the product name SEQ_FEAT.WrongQualOnFeature = ncRNA_class on mRNAs SEQ_FEAT.MissingQualOnFeature = missing ncRNA_class on ncRNA SEQ_FEAT.PseudoCdsViaGeneHasProduct = /pseudo not set up correctly

These may get sorted out: SEQ_FEAT.MissingCDSproduct SEQ_FEAT.NoProtein SEQ_FEAT.CDSwithMultipleMRNAs SEQ_FEAT.FeatureProductInconsistency SEQ_FEAT.CDSmRNArange

And here is a closer look at the actual annotation:

In 000400F_1_903666_quiver_pilon, Focus on CK280_G0054588:

.tbl file:

519401 454972 gene locus_tag CK280_G0054588 518622 518370 ncRNA 518271 518095 457283 454972 transcript_id gnl|CK280|T0170363 ncRNA_class other product MKRN3 note GENCODE_biotype|processed_transcript 519362 518992 mRNA 474410 473483 transcript_id gnl|CK280|T0170364 product MKRN3 note GENCODE_biotype|protein_coding 519296 518992 CDS 474410 474398 product MKRN3 protein_id gnl|CK280|T0170364_prot transcript_id gnl|CK280|T0170364 519401 519246 mRNA 518271 518095 509719 509583 transcript_id gnl|CK280|T0170365 ncRNA_class other product MKRN3 note GENCODE_biotype|nonsense_mediated_decay 519296 519246 CDS 518271 518236 product MKRN3 protein_id gnl|CK280|T0170365_prot transcript_id gnl|CK280|T0170365 519306 518992 mRNA 518271 518095 509719 509583 transcript_id gnl|CK280|T0170366 product MKRN3 note GENCODE_biotype|protein_coding 519296 518992 CDS 518271 518095 509719 509653 product MKRN3 protein_id gnl|CK280|T0170366_prot transcript_id gnl|CK280|T0170366 519395 517058 mRNA transcript_id gnl|CK280|T0170368 product MKRN3 note GENCODE_biotype|protein_coding 519296 517773 CDS product MKRN3 protein_id gnl|CK280|T0170368_prot transcript_id gnl|CK280|T0170368

flatfile view from the .sqn file:

 gene            complement(454972..519401)
                 /locus_tag="CK280_G0054588"
 ncRNA           complement(join(454972..457283,518095..518271,
                 518370..518622))
                 /ncRNA_class="other"
                 /locus_tag="CK280_G0054588"
                 /product="MKRN3"
                 /note="GENCODE_biotype|processed_transcript"
 mRNA            complement(join(473483..474410,518992..519362))
                 /locus_tag="CK280_G0054588"
                 /product="MKRN3"
                 /note="GENCODE_biotype|protein_coding"
 CDS             complement(join(474398..474410,518992..519296))
                 /locus_tag="CK280_G0054588"
                 /codon_start=1
                 /product="MKRN3"
                 /protein_id="CK280:T0170364_prot"
                 /translation="-
                 SRGEGKRDAHFP*RSLRARPPFRASSP*RKNTGEVLAPFRGAKAAMEEPAAPSEAHEA
                 AGAQAGAEAAREGVSGPDLPVCEPSGESAAPDSALPHAARGWASFGS"
 mRNA            complement(join(509583..509719,518095..518271,
                 519246..519401))
                 /ncRNA_class="other"
                 /locus_tag="CK280_G0054588"
                 /product="MKRN3"
                 /note="GENCODE_biotype|nonsense_mediated_decay"
 mRNA            complement(join(509583..509719,518095..518271,
                 518992..519306))
                 /locus_tag="CK280_G0054588"
                 /product="MKRN3"
                 /note="GENCODE_biotype|protein_coding"
 CDS             complement(join(509653..509719,518095..518271,
                 518992..519296))
                 /locus_tag="CK280_G0054588"
                 /codon_start=1
                 /product="MKRN3"
                 /protein_id="CK280:T0170366_prot"
                 /translation="-
                 SRGEGKRDAHFP*RSLRARPPFRASSP*RKNTGEVLAPFRGAKAAMEEPAAPSEAHEA
                 AGAQAGAEAAREGVSGPDLPVCEPSGESAAPDSALPHAARGWAMELSFAVQRGMDKVC
                 GICMEVVYEKANPNDRRFGILSNCNHSFCIRCIRRWRSARQFENLSSRLNVGSITSFF
                 RLISPFWN"
 gene            515425..517815
                 /locus_tag="CK280_G0054589"
 ncRNA           join(515425..515785,517611..517815)
                 /ncRNA_class="antisense_RNA"
                 /locus_tag="CK280_G0054589"
                 /product="MKRN3-AS1"
                 /note="GENCODE_biotype|antisense"
 mRNA            complement(517058..519395)
                 /locus_tag="CK280_G0054588"
                 /product="MKRN3"
                 /note="GENCODE_biotype|protein_coding"
 CDS             complement(517773..519296)
                 /locus_tag="CK280_G0054588"
                 /codon_start=1
                 /product="MKRN3"
                 /protein_id="CK280:T0170368_prot"
                 /translation="-
                 SRGEGKRDAHFP*RSLRARPPFRASSP*RKNTGEVLAPFRGAKAAMEEPAAPSEAHEA
                 AGAQAGAEAAREGVSGPDLPVCEPSGESAAPDSALPHAARGWAPFPVAPVPAHLRRGG
                 LRPAPASGGGAWPSPLPSRSSGIWTKQIICRYYIHGQCKEGENCRYSHDLSGRKMATE
                 GGVLPPGASAGGGPSTAAHIEPPTQEVAEAPPAASSLSLPVIGSAAERGFFEAERDNA
                 DRGAAGGAGVESWADAIEFVPGQPYRGRWVASAPEAPLQSSETERKQMAVGSGLRFCY
                 YASRGVCFRGESCMYLHGDICDMCGLQTLHPMDAAQREEHMRACIEAHEKDMELSFAV
                 QRGMDKVCGICMEVVYEKANPNDRRFGILSNCNHSFCIRCIRRWRSARQFENRIVKSC
                 PQCRVTSELVIPSEFWVEEEEEKQKLIQQYKEAMSNKACRYFAEGRGNCPFGDTCFYK
                 HEYPEGWGDEPPGPGGGSFSAYWHQLVEPVRMGEGNMLYKSIK"
 CDS             complement(join(518236..518271,519246..519296))
                 /locus_tag="CK280_G0054588"
                 /codon_start=1
                 /product="MKRN3"
                 /protein_id="CK280:T0170365_prot"
                 /translation="-SRGEGKRDAHFP*RSLYGTLVCCAAWYG"

Questions & comments:

[1] What's the ncRNA gnl|CK280|T0170363 doing there? [2] You need to include the protein_id on the corresponding mRNA in the .tbl file OR use GFF and rely on parent:child for linkages (will use the locus_tag as the base). If this came from the .sqn that came from the .gff, then apologies. We are having problems with that GFF converter & hoped to have it finally working next week. [3] don't include ncRNA processed transcripts [4] remove the GENCODE_biotype|xxx notes, eg GENCODE_biotype|processed_transcript. Why are they included? [5] don't include ncRNA_class qualifiers on mRNAs. They belong just on ncRNAs. ANd they cause WrongQualOnFeature errors. [6] why does one mRNA have "GENCODE_biotype|nonsense_mediated_decay"? Is CDS gnl|CK280|T0170365_prot not actually made? If not, then don’t include that CDS/mRNA. [7] why do they all have the same product name but have different translations? Include 'isoform' or something on the product name, yes? [8] in the .tbl format, you have to provide the codon_start, meaning the first base of the first full codon of the CDS. Eg this doesn't have any internal stop codons if it begins with codon_start=2 (so bp519295 is the first base of the first full codon, since this is on the minus strand):

 CDS             complement(join(474398..474410,518992..519296))
                 /locus_tag="CK280_G0054588"
                 /codon_start=1
                 /product="MKRN3"
                 /protein_id="CK280:T0170364_prot"
                 /translation="-
                 SRGEGKRDAHFP*RSLRARPPFRASSP*RKNTGEVLAPFRGAKAAMEEPAAPSEAHEA
                 AGAQAGAEAAREGVSGPDLPVCEPSGESAAPDSALPHAARGWASFGS"

->

 CDS             complement(join(474398..474410,518992..519296))
                 /locus_tag="CK280_G0054588"
                 /codon_start=2
                 /product="MKRN3"
                 /protein_id="CK280:T0170364_prot"
                 /translation="KAGGKEKEMHTSPREASERGRHSGPQAHKEKIPERFWHHFGVPK
                 QPWKSLQLPQKPTRQPGPRQVLRQQGRVCLGRTFPSVSPPGNLLLQIQPCHMRQGAGR
                 VLEA"

Although that might not be the problem; it could be indels in the sequence, which would have a different solution (below).

[8] some product names are actually the _ids:

 gene            518713..519222
                 /locus_tag="CK280_G0054590"
 mRNA            join(518713..518741,519156..519222)
                 /locus_tag="CK280_G0054590"
                 /product="gnl|CK280|T0170369"
                 /note="GENCODE_biotype|protein_coding"
 CDS             join(<518713..518741,519156..>519222)
                 /locus_tag="CK280_G0054590"
                 /codon_start=1
                 /product="gnl|CK280|T0170369"
                 /protein_id="CK280:T0170369_prot"
                 /translation="MCAAVLGPPPMAALAPRNGARTSPVFFLYGLE"

518713 519222 gene locus_tag CK280_G0054590 518713 518741 mRNA 519156 519222 transcript_id gnl|CK280|T0170369 product gnl|CK280|T0170369 note GENCODE_biotype|protein_coding <518713 518741 CDS 519156 >519222 product gnl|CK280|T0170369 protein_id gnl|CK280|T0170369_prot transcript_id gnl|CK280|T0170369

[9] in the .tbl format, we need the partial symbols added. (The conversion program will add them when GFF is the input)

So the .tbl file should be like this, with the desired product name, 5' and 3' partial symbols on the features since the CDS doesn't have a start or stop codon, protein_id & transcript_id on both the mRNA & CDS:

<518713 >519222 gene locus_tag CK280_G0054590 <518713 518741 mRNA 519156 >519222 transcript_id gnl|CK280|T0170369 protein_id gnl|CK280|T0170369_prot product hypothetical protein <518713 518741 CDS 519156 >519222 product hypothetical protein protein_id gnl|CK280|T0170369_prot transcript_id gnl|CK280|T0170369 [10] internal stop codons are illegal, so you need to do one of these things for those CDS (which is in my earlier email below your reply, below): * add /pseudo to the gene if you think that the CDS is 'broken', meaning that it doesn't translate as you expect. (in a .tbl file, you'll change 'product' to 'note'; in a .gff file you don't need to make that change) * add /pseudogene (and the right qualifier) if you think the gene is actually a pseudogene (in a .tbl file, you'll change 'product' to 'note'; in a .gff file you don't need to make that change) * replace the CDS and gene with a misc_feature if you think that this is just a remnant of similarity that was found, but isn't a real gene * if you think that the gene really does encode a protein but the nucleotide sequence is poor, then change the CDS location so that it produces the desired translation AND include the "low-quality sequence region" exception on the CDS. That will allow the CDS to have a translation & will quiet the validation errors. The protein definition line will be prepended with "LOW QUALITY:" as a warning to database users. B. Sequence 000100F_1_9097766_quiver_pilon comments [11] Not sure what this is supposed to do. However, if bp31124 is an insert relative to the human genome, then you'd adjust the CDS location to jump over it & add the "low-quality sequence region" exception. CDS join(30043..30064,31121..31348,32921..32954,36740..36817, 40104..40284,40924..41078,41244..41384,42592..42685, 44775..44855,50299..50472,51704..51793,51950..52118, 54688..54740) /locus_tag="CK280_G0032352" /codon_start=1 /product="GSPT1" /protein_id="CK280:T0101284_prot" /translation="- TFRTYW*KWRDRNVSRRIMGAQRRNK*SRARGWFLGRWKAARGKCP*NDGGGRGNPKT *VCGCTARCS*ERACKCSIHWARRCWQVNHWRTNNVYLTGMVDKRTLEKYEREAKEKN RETWYLSWALDTNQEERDKGKTVEVGRAYFETEKKHFTILDAPGHKSFVPNMIGGASQ ADLAVLVISARKGEFETGFEKGGQTREHAMLAKTAGVKHLIVLINKMDDPTVNWSNER YEECKEKLVPFLKKVGFNPKKDIHFMPCSGLTGANLKEQSDFCPWYIGLPFIPYLDNL PNFNRSVDGPIRLPIVDKYKIWALWSWESWNQDLFVKASSL**CQTRNVEVLGILSDD VETDTVAPGENLKIRLKGIEEEEILPGFILCDPNNLCHSGRTFDAQVVIIEHKSIICP GYNAVLHIHTCIEEVEITVLICLVDKKSGEKSKTRPRFVKQDQVCIARLRTAGTICLE TFKDFPQMGRFTLRDEGKTIAIGKVLKLVPEKD*A" misc_feature 31124 /locus_tag="CK280_G0032352" /note="gap added in CDS to maintain frame, possibly due to error in genome" [12] Here's another one... it has abutting intervals to adjust for an indel BUT it still has internal stop codons: CDS complement(join(7338291..7338508,7339244..7339324, 7343790..7343889,7344083..7344242,7345277..7345369, 7346224..7346286,7346807..7346925,7350256..7350420, 7351494..7351568,7358473..7358583,7359561..7359646, 7361952..7362511,7362524..7362973,7362974..7363346, 7363462..7363548,7363731..7363811,7366863..7366967, 7381325..7381744)) /locus_tag="CK280_G0032515" /note="gaps were added to CDS to maintain frame" /codon_start=1 /product="C16orf96" /protein_id="CK280:T0101780_prot" /translation="- PLHSCLGNRERLCHTHYTHTHTHTHTHTRKWNTEGGLWMRPEMNVVLQILFFFLRRSL ALSPRLECSGTILAH*NLHLPDLSNSLPQPPE*LGLQARATTPG*FLYF**RRGFTVL ARLVLNS*PRGPPAFASQSAGITQPTCFSVTLSVLSVFPSGIL*RALLSACMFIIIIS KKKVGSEGWRGGSRL*SQHFGRLRWADCTPAWATE*DSVSKKKKKKKEKKNNALLASA SLVQVILLPQPPE*LGLQVQATTPS*FLCF*WRWGLHHIGQAGPELLNS*STRLSIPK CWDYRREPLCPALLEILNVLLLIQGSGKSCSTETC*NFIYLFIYLLIFETKSHSVAQA GVQWRHLSSLQPLPPGFK*FSCLSFLIAGITGACHQAQLIFVFLVGTVFHHVGQAAVE LLTSSYPTTLTSQSAGVTGMRHCARPDMLSSLAFPKLVCPWNPFPLLGKHWIWNPFFL LKLLVT*SHRKRQGWDRTWTPSPQTQFL*QVHPERMDIFAAQNWKMVALQREVVRATI PVFPHSPWDPPTSRVLGAEGREVQGGEHSLERFLCRCGDTAPSRLTALIFYPHLAGFS PE*V*NHPQNRGHGALEWPP*CHVHLSE*GRKGRLAGSWGAGGRLTESGRRGWKPAEP LCPLFCRKLVDHRWTCGSL*SSSQRQPWPRPPSTLKLLVPSRSPSPSKTPSYCCSGSR

diekhans commented 6 years ago

From: Mark Diekhans markd@soe.ucsc.edu To: NCBI Genomes genomes@ncbi.nlm.nih.gov Cc: Ian Fiddes ian.t.fiddes@gmail.com Subject: RE: GFF input for ape genome (NBAG00000000) Date: Tue, 3 Oct 2017 17:15:57 -0700

Hi Karen,

The file you looked at did have a bug we have just fixed were our code to insert gaps to avoid frameshifts end up creating downstream frameshifts Apologizes.

We shifted from GFF3 to tbl format because table2asn didn't honor the phase and didn't output GBFF files. It just easier to control things with the tbl format. GBFF is much easier for understand the results than the ASN.1

Answers (and questions) to your questions below.

NCBI Genomes genomes@ncbi.nlm.nih.gov writes:

[1] What's the ncRNA gnl|CK280|T0170363 doing there?

GENCODE annotates non-coding transcripts in coding genes. These are consider part of the same gene, even though they don't code for proteins.

[2] You need to include the protein_id on the corresponding mRNA in the .tbl file OR use GFF and rely on parent:child for linkages (will use the locus_tag as the base). If this came from the .sqn that came from the .gff, then apologies. We are having problems with that GFF converter & hoped to have it finally working next week.

We will add this.

[3] don't include ncRNA processed transcripts

While I have mixed feelings about the `processed transcripts' category, they are part of the gene set. One successful mapped to another species has value.

[4] remove the GENCODE_biotype|xxx notes, eg GENCODE_biotype| processed_transcript. Why are they included?

Why do they need to be removed? Aren't they comments? The biotype of the source transcript is very useful in understanding the annotation, especially since the mapping from GENCODE biotype to feature is not one-to-one. I will change the note to make it a bit clearer: CAT source GENCODE transcript biotype: XXX

[5] don't include ncRNA_class qualifiers on mRNAs. They belong just on ncRNAs. ANd they cause WrongQualOnFeature errors.

That is a mistake, we will fix it.

[6] why does one mRNA have "GENCODE_biotype|nonsense_mediated_decay"? Is CDS gnl|CK280|T0170365_prot not actually made? If not, then don’t include that CDS/ mRNA.

There is good evidence that that NMD is an imperfect mechanism (e.g. doi:10.1038/sj.ejhg.5201649) and thus GENCODE annotates the CDS. It's also really useful for understanding why the transcript is considered NMD if you know where the CDS and not have recompute it.

[7] why do they all have the same product name but have different translations? Include 'isoform' or something on the product name, yes?

We just used the gene name due to the lack of a standardized way to name protein isoforms (stable isofrom names are something GENCODE want's to implement).

We can tack on an `_isoformN' suffix, but this will only be for the NCBI submission, not a stable identifier.

[8] in the .tbl format, you have to provide the codon_start, meaning the first base of the first full codon of the CDS. Eg this doesn't have any internal stop codons if it begins with codon_start=2 (so bp519295 is the first base of the first full codon, since this is on the minus strand):

We will add this.

[8] some product names are actually the _ids:

This is because these genes have no HUGO symbol. I will generate better names that don't include the `gnl|'.

[9] in the .tbl format, we need the partial symbols added. (The conversion program will add them when GFF is the input)

We can get this in, at least for the CDS.

[10] internal stop codons are illegal, so you need to do one of these things for those CDS (which is in my earlier email below your reply, below):

our bug fix greatly reduces the number of in-frame stops.

if you think that the gene really does encode a protein but the nucleotide sequence is poor, then change the CDS location so that it produces the desired translation AND include the "low-quality sequence region" exception on the CDS. That will allow the CDS to have a translation & will quiet the validation errors. The protein definition line will be prepended with "LOW QUALITY:" as a warning to database users.

This is the approach we are taking.

Thank you!

Mark

diekhans commented 6 years ago

Checklist of things to do for next round:

[x] Q2: include the protein_id on the corresponding mRNA
[x] Q4: change biotype note to: CAT source GENCODE transcript biotype: XXX
[x] Q5: don't include ncRNA_class qualifiers on mRNAs
[x] Q6: what to do about NMD
[x] Q7: unique product and protein names
[x] Q8: /codon_start
[x] Q8b: some product names are actually the _ids
[x] Q9: partial symbols
[x] need to get start/end complete from gp_info or gp , since don't have sequence.
[x] what to do about real inframe stop insertions

ifiddes commented 6 years ago

See commit e15bb26. Here is the new email, some of which has been addressed:

[1] CDS product names

A. I see that you using the format _1, _2, etc.

Please use "protein isoform N" (where is the gene symbol).

eg "protein ARTN isoform 6", not "ARTN_6"

The RefSeq pipeline projects the human SwissProt names, with an "isoform N" suffix. That's an excellent name source for vertebrates -- very few names that aren't transferrable verbatim. They don't fuss over using the same "isoform N" identifier for the equivalent isoform in different species.

Can you adopt similar rules?

The request to use "protein " rather than just "" comes from the UniProt protein naming guidelines.

B. some are nucleotide accession.version, eg AP000146.1. That's not allowed, so please use the product name that's on the other record if it conforms to the UniProt guidelines OR use 'hypothetical protein' or 'uncharacterized protein'.

You can also add an inference to point to that nucleotide record (see https://www.ncbi.nlm.nih.gov/genbank/evidence/), eg:

    inference       similar to DNA sequence|INSD|AP000146.1

C. some have the format "HGNC:ID_x". Instead, please use the Approved symbol (or Approved name if it conforms to the UniProt protein naming guidelines).

eg "protein NSG1" instead of "HGNC:18790_1"

[2] ncRNA product names

A. don't 'uniquify' the ncRNAs with _1, etc; just call them by their products.

examples:

count name

  1 LINC00309_1
  1 LINC00310_1
  1 LINC00310_2
  1 LINC00310_3
  6 U2_1
 49 U3_1

B. these are the only non-capitalized/non-gene symbol ones. Are they expected?

count name

  1 hsa-mir-1253_1
  1 hsa-mir-3119-1_1
  1 hsa-mir-3130-1_1
  1 hsa-mir-3158-1_1
  1 hsa-mir-3607_1
  1 hsa-mir-4536-1_1
  1 hsa-mir-4773-1_1
  1 hsa-mir-4776-1_1
  1 hsa-mir-548d-1_1
  1 hsa-mir-548d-2_1
  1 hsa-mir-550a-1_1
  1 hsa-mir-550a-2_1
  2 mascRNA-menRNA_1
  7 pRNA_1
  2 snoMBII-202_1
  3 snoMe28S-Am2634_1
  1 snoR1_1
  6 snoU109_1
 29 snoU13_1
  1 snoU18_1
  2 snoU2-30_1
  4 snoU2_19_1
  1 snoU83B_1
  1 snoZ196_1
  1 snoZ278_1
  1 snoZ40_1
  4 snoZ6_1
  1 snosnR66_1
 36 uc_338_1

C. isn't this actually "RNase_P_RNA" as the ncRNA_class & "RNase P RNA" without underscores as the product?

  7 pRNA_1

D. What's this? should the product be 'MIR338' with 'miRNA' as the ncRNA_class?

 36 uc_338_1

E. don't use systematic name for ncRNA product- T0109237_1. Especially when the class is 'other'. What is this?

2127 2019 gene locus_tag CK280_G0031091 pseudogene unprocessed 2127 2019 ncRNA transcript_id gnl|CK280|T0109237 ncRNA_class other product T0109237_1 protein_id T0109237_1_prot note CAT transcript id: T0109237 note CAT alignment id: ENST00000612457.1-0 note CAT source transcript id: ENST00000612457.1 note CAT source GENCODE transcript biotype: unprocessed_pseudogene

 gene            4039059..4044991
                 /locus_tag="CK280_G0031194"
 ncRNA           join(4039059..4039201,4044690..4044991)
                 /ncRNA_class="lncRNA"
                 /locus_tag="CK280_G0031194"
                 /product="T0109601_1"
                 /note="CAT transcript id: T0109601;
                 CAT alignment id: ENST00000567103.1-0;
                 CAT source transcript id: ENST00000567103.1;
                 CAT source GENCODE transcript biotype: lincRNA"

Errors. [3] Data errors in the first 100 sequences, after making these changes:

[a] This CDS cannot be translated and has to be removed or flagged as /pseudo or /pseudogene in order to run tbl2asn

<9093066 9093064 CDS codon_start 1 product NPIPB12_4 protein_id T0110400_4_prot transcript_id gnl|CK280|T0110400

That's true for any CDS that just includes a stop codon or less than a full codon. (my test was just the first 100 sequences)

[b] There were 50 ShortIntron errors in the first 100 sequences of my test, so I ran it again and included "-c s" in the command line to automatically add the exception for short introns.

You should run:

tbl2asn -M n -c s -i First100.fsa -j "[organism=Pan troglodytes]" -t SubmissionTemplate.sbt -f Clint_Chimp.V1.kc.tbl -o First100.annot.sqn &

[4] Here is the list of errors with a brief description of what needs to be done. Additional detail for some of them is below.

I'll also try to post the full error file back to the SUB2821604 submission. I hope it's there now.

================================================================= 5780 ERROR-level messages exist

SEQ_FEAT.MissingGeneXref 19 FIX- include locus_tag on mRNAs & CDS to assure correct linkage to gene SEQ_FEAT.CDSmRNAXrefLocationProblem 2 FIX SEQ_FEAT.AbuttingIntervals 556 we need to quiet this when the 'low-quality sequence region' exception is present SEQ_FEAT.ShortIntron 7 we need to quiet the error for /pseudo & /pseudogene SEQ_FEAT.MissingQualOnFeature 2183 FIX- missing ncRNA_class on ncRNA features SEQ_FEAT.NoStop 935 FIX- partial symbols on minus-strand CDS SEQ_FEAT.PartialProblem 1771 FIX- have stop codon but 3' is partial SEQ_FEAT.MissingTrnaAA 4 FIX SEQ_FEAT.PseudoCdsViaGeneHasProduct 55 FIX- use 'note', not 'product' when gene is /pseudo or /pseudogene SEQ_FEAT.WrongQualOnFeature 31 FIX- illegal ncRNA_class on mRNA SEQ_FEAT.NoProtein 87 FIX- include product for non-pseudo/pseudogene CDS SEQ_FEAT.MissingCDSproduct 87 FIX (same as previous) SEQ_FEAT.FeatureProductInconsistency 43 FIX- I think this is a different report of those in the NoProtein category

================================================================= 41510 WARNING-level messages exist

SEQ_FEAT.FeatContentDup 424 FIX: misc_features describing the inserted gap in the CDS. Just 1 per gap SEQ_FEAT.mRNAgeneRange 1 FIX SEQ_FEAT.CDSgeneRange 1 FIX (same gene as previous error) SEQ_FEAT.InconsistentPseudogeneValue 259 FIX- use the same type of pseudogene on the gene & its parts SEQ_FEAT.CDSmRNAmismatch 17769 FIX. use "gnl|CK280|T0109603_prot" format for the protein_ids SEQ_FEAT.DuplicateFeat 5 INVESTIGATE. Should genes move or be merged? SEQ_FEAT.CDSmRNArange 552 FIX. Use the same 'frameshift' in the mRNA that you put in the CDS SEQ_FEAT.PartialProblem 14703 FIX?

=================================================================

"FYI" category of WARNING:

SEQ_FEAT.NotSpliceConsensusDonor 3627 SEQ_FEAT.MultipleGeneOverlap 132 SEQ_FEAT.NotSpliceConsensusAcceptor 3684

Some more detail (in random order, sadly):

[a] WrongQualOnFeature. Missing ncRNA_class.

For example- product = IQCJ-SCHIP1-AS1_1 https://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=HGNC:41303 Approved symbol IQCJ-SCHIP1-AS1 Approved name IQCJ-SCHIP1 readthrough antisense RNA 1 HGNC ID HGNC:41303 Previous symbols & names "IQCJ-SCHIP1 readthrough antisense RNA 1 (non-protein coding)" Synonyms - Locus type RNA, long non-coding

BUT it's an ncRNA with a protein_id AND missing its ncRNA_class:

12948115 12951338 gene locus_tag CK280_G0006216 12948115 12948410 ncRNA transcript_id gnl|CK280|T0021666 product IQCJ-SCHIP1-AS1_1 protein_id T0021666_1_prot note CAT transcript id: T0021666 note CAT alignment id: ENST00000488247.1-0 note CAT source transcript id: ENST00000488247.1 note CAT source GENCODE transcript biotype: antisense_RNA

I think the protein_id is ignored, but each ncRNA requires an ncRNA_class qualifier.

[b] SEQ_FEAT.CDSmRNAXrefLocationProblem. Check that the CDS matches its mRNA intron/exon junctions:

ERROR: valid [SEQ_FEAT.CDSmRNAXrefLocationProblem] CDS not contained within cross-referenced mRNA FEATURE: CDS: YIF1A_8 <20473> [(lcl|000122F_1_7494545_quiver_pilon:<4791230-4791260, 4791923-4792134, 4792299-4792403, 4792539-4792617, 4794679-4794734)] [lcl|000122F_1_7494545_quiver_pilon: raw, dna len= 7505160] -> [lcl|T0118910_8_prot]

ERROR: valid [SEQ_FEAT.CDSmRNAXrefLocationProblem] CDS not contained within cross-referenced mRNA FEATURE: CDS: FAM49B_27 <64551> [(lcl|000178F_1_4597154_quiver_pilon:<4154736-4154797, 4164163-4164235, 4164238-4164269, 4170751-4170823, 4170825-4170887, 4172161-4172238, 4174234-4174350, 4175595-4175674, 4177133-4177175, 4177178-4177213)] [lcl|000178F_1_4597154_quiver_pilon: raw, dna len= 4600721] -> [lcl|T0137575_27_prot]

[c] SEQ_FEAT.MissingTrnaAA . You need to include the actual amino acid. If you don't know, then use "tRNA-Xxx" (see https://www.ncbi.nlm.nih.gov/genbank/eukaryotic_genome_submission_annotation/#rRNA)

ERROR: valid [SEQ_FEAT.MissingTrnaAA] Missing encoded amino acid qualifier in tRNA FEATURE: tRNA: MT-TS1_1~CAT transcript id: T0114730~CAT alignment id: ENST00000387416.2-0~CAT source transcript id: ENST00000387416.2~CAT source GENCODE transcript biotype: Mt_tRNA <12486> [lcl|000113F_1_7786117_quiver_pilon:c372183-372115] [lcl|000113F_1_7786117_quiver_pilon: raw, dna len= 7793092]

ERROR: valid [SEQ_FEAT.MissingTrnaAA] Missing encoded amino acid qualifier in tRNA FEATURE: tRNA: MT-TD_1~CAT transcript id: T0114731~CAT alignment id: ENST00000387419.1-1~CAT source transcript id: ENST00000387419.1~CAT source GENCODE transcript biotype: Mt_tRNA <12488> [lcl|000113F_1_7786117_quiver_pilon:372187-372254] [lcl|000113F_1_7786117_quiver_pilon: raw, dna len= 7793092]

ERROR: valid [SEQ_FEAT.MissingTrnaAA] Missing encoded amino acid qualifier in tRNA FEATURE: tRNA: MT-TK_1~CAT transcript id: T0114732~CAT alignment id: ENST00000387421.1-2~CAT source transcript id: ENST00000387421.1~CAT source GENCODE transcript biotype: Mt_tRNA <12490> [lcl|000113F_1_7786117_quiver_pilon:372939-373008] [lcl|000113F_1_7786117_quiver_pilon: raw, dna len= 7793092]

ERROR: valid [SEQ_FEAT.MissingTrnaAA] Missing encoded amino acid qualifier in tRNA FEATURE: tRNA: MT-TH_1~CAT transcript id: T0114735~CAT alignment id: ENST00000387441.1-1~CAT source transcript id: ENST00000387441.1~CAT source GENCODE transcript biotype: Mt_tRNA <12496> [lcl|000113F_1_7786117_quiver_pilon:376779-376847] [lcl|000113F_1_7786117_quiver_pilon: raw, dna len= 7793092]

[d] mRNAgeneRange & CDSgeneRange. The gene needs to include its mRNA & CDS features.

WARNING: valid [SEQ_FEAT.mRNAgeneRange] gene [lcl|000158F_1_5482593_quiver_pilon:c4100491-4100201:CK280_G0037675] overlaps mRNA but does not completely contain it FEATURE: mRNA: mRNA-CAT transcript id: T0132287~CAT alignment id: augCGP-10540.t1~CAT source transcript id: nan~CAT novel prediction: augCGP <52278> [(lcl|000158F_1_5482593_quiver_pilon:c4109103-4109056, c4107380-4107261, c4106367-4106308, c4106126-4106046, c4105192-4105154, c4104891-4104844, c4104755-4104671, c4104575-4104466, c4103423-4103337, c4103230-4103123, c4102927-4102840, c4102427-4102171, c4100598-4100523, c4100090-4099999, c4099423-4099323, c4099240-4099143, c4099047-4098892, c4098807-4098797, c4081755-4081504)] [lcl|000158F_1_5482593_quiver_pilon: raw, dna len= 5487571]

WARNING: valid [SEQ_FEAT.CDSgeneRange] gene [lcl|000158F_1_5482593_quiver_pilon:c4100491-4100201:CK280_G0037675] overlaps CDS but does not completely contain it FEATURE: CDS: /orig_transcript_id=gnl|CK280|T0132287 <51984> [(lcl|000158F_1_5482593_quiver_pilon:c4109103-4109056, c4107380-4107261, c4106367-4106308, c4106126-4106046, c4105192-4105154, c4104891-4104844, c4104755-4104671, c4104575-4104466, c4103423-4103337, c4103230-4103123, c4102927-4102840, c4102427-4102171, c4100598-4100523, c4100090-4099999, c4099423-4099323, c4099240-4099143, c4099047-4098892, c4098807-4098797, c4081755-4081504)] [lcl|000158F_1_5482593_quiver_pilon: raw, dna len= 5487571]

[e] DuplicateFeat. These 5 genes have the same location. Why? Should any of them move or be merged?

WARNING: valid [SEQ_FEAT.DuplicateFeat] Features have identical intervals, but labels differ FEATURE: Gene: CK280_G0034171 <25679> [lcl|000124F_1_7476676_quiver_pilon:c107627-74834] [lcl|000124F_1_7476676_quiver_pilon: raw, dna len= 7482652] WARNING: valid [SEQ_FEAT.DuplicateFeat] Features have identical intervals, but labels differ FEATURE: Gene: CK280_G0036267 <40688> [lcl|000147F_1_5954191_quiver_pilon:c4761692-4760690] [lcl|000147F_1_5954191_quiver_pilon: raw, dna len= 5963735] WARNING: valid [SEQ_FEAT.DuplicateFeat] Features have identical intervals, but labels differ FEATURE: Gene: CK280_G0037661 <52582> [lcl|000158F_1_5482593_quiver_pilon:3981636-3991317] [lcl|000158F_1_5482593_quiver_pilon: raw, dna len= 5487571] WARNING: valid [SEQ_FEAT.DuplicateFeat] Features have identical intervals, but labels differ FEATURE: Gene: CK280_G0040765 <78520> [lcl|000196F_1_4064470_quiver_pilon:c2013577-2007066] [lcl|000196F_1_4064470_quiver_pilon: raw, dna len= 4071234] WARNING: valid [SEQ_FEAT.DuplicateFeat] Features have identical intervals, but labels differ FEATURE: Gene: CK280_G0040766 <78522> [lcl|000196F_1_4064470_quiver_pilon:c2013577-2007066] [lcl|000196F_1_4064470_quiver_pilon: raw, dna len= 4071234]

[f] FeatContentDup. There are 189 individual instances, but 424 duplicates. For example, here are 3 different instances listed below. I think you should just include one misc_feature to indicate the frameshift that was incorporated into the CDS location.

WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <757> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <759> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <771> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <773> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <776> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <778> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156 WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <780> [lcl|000100F_1_9097766_quiver_pilon:370167-370168] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <1148> [lcl|000100F_1_9097766_quiver_pilon:3346704-3346704] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.FeatContentDup] Duplicate feature FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <1733> [lcl|000100F_1_9097766_quiver_pilon:c8408055-8408054] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156]

[g] CDSmRNArange.

WARNING: valid [SEQ_FEAT.CDSmRNArange] mRNA contains CDS but internal intron-exon boundaries do not match FEATURE: CDS: LITAF_1 <36> [(lcl|000100F_1_9097766_quiver_pilon:<292223-292307, 370130-370166, 370169-370307)] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] -> [lcl|T0109302_1_prot] WARNING: valid [SEQ_FEAT.CDSmRNArange] mRNA contains CDS but internal intron-exon boundaries do not match FEATURE: CDS: LITAF_3 <38> [(lcl|000100F_1_9097766_quiver_pilon:370135-370166, 370169-370353, 373170-373326, 377114-377222)] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] -> [lcl|T0109304_3_prot] WARNING: valid [SEQ_FEAT.CDSmRNArange] mRNA contains CDS but internal intron-exon boundaries do not match FEATURE: CDS: LITAF_4 <39> [(lcl|000100F_1_9097766_quiver_pilon:<370135-370166, 370169-370353, 373170-373326, 377114->377197)] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] -> [lcl|T0109305_4_prot]

Add the exception to the CDS, and/or include "-c s" in the command line. AND include the frameshift in the mRNA location too.

000100F_1_9097766_quiver_pilon

8428337 8428262 mRNA 8422727 8422644 8418327 8418194 8415062 8414921 8411817 8411745 8408071 8408056 8408055 8407965 8405868 8405813 8402535 8400988 transcript_id gnl|CK280|T0110052 exception low-quality sequence region product CLUAP1_4 protein_id T0110052_4_prot note CAT transcript id: T0110052 note CAT alignment id: ENST00000572600.5-0 note CAT source transcript id: ENST00000572600.5 note CAT source GENCODE transcript biotype: protein_coding 8422724 8422644 CDS 8418327 8418194 8415062 8414921 8411817 8411745 8408071 8408056 8408053 8407965 8405868 8405813 8402535 8402386 codon_start 1 product CLUAP1_4 protein_id T0110052_4_prot transcript_id gnl|CK280|T0110052 note gaps were added to CDS to maintain frame 8408055 8408054 misc_feature note gap added in CDS to maintain frame, possibly due to error in genome

The CDS got the /exception when I included "-c s":

 CDS             complement(join(8402386..8402535,8405813..8405868,
                 8407965..8408053,8408056..8408071,8411745..8411817,
                 8414921..8415062,8418194..8418327,8422644..8422724))
                 /locus_tag="CK280_G0031283"
                 /artificial_location="low-quality sequence region"
                 /note="gaps were added to CDS to maintain frame"
                 /codon_start=1
                 /product="CLUAP1_4"
                 /translation="MRAEAIARPLEINETEKVMRIAIKEILTQVQKTKDLLNNVASDE

AND include the frameshift in the mRNA location too.

[h] NoProtein = no CDS product in the .tbl file for a non-pseudo or non-pseudogene CDS

Need a "product" on each non-pseudo CDS & its mRNA. They're missing on these examples:

Features 000104F_1_8731062_quiver_pilon 2116 1880866 gene locus_tag CK280_G0031754 2116 2307 mRNA 3067 3243 transcript_id gnl|CK280|T0111823 protein_id T0111823_1_prot note CAT transcript id: T0111823 note CAT alignment id: augCGP-12597.t1 note CAT source transcript id: nan note CAT novel prediction: augCGP 2116 2307 CDS 3067 3243 codon_start 1 protein_id T0111823_1_prot transcript_id gnl|CK280|T0111823 1879603 1879199 mRNA 1878392 1878333 transcript_id gnl|CK280|T0111863 protein_id T0111863_2_prot note CAT transcript id: T0111863 note CAT alignment id: augCGP-12605.t1 note CAT source transcript id: nan note CAT novel prediction: augCGP 1879603 1879199 CDS 1878392 1878333 codon_start 1 protein_id T0111863_2_prot transcript_id gnl|CK280|T0111863 1880866 1880458 mRNA 1879966 1879872 transcript_id gnl|CK280|T0111864 protein_id T0111864_3_prot note CAT transcript id: T0111864 note CAT alignment id: augCGP-12606.t1 note CAT source transcript id: nan note CAT novel prediction: augCGP 1880866 1880458 CDS 1879966 1879872 codon_start 1 protein_id T0111864_3_prot transcript_id gnl|CK280|T0111864

[i] MissingGeneXref. Include the locus_tag on each mRNA & CDS to assure they're properly linked to the gene. This isn't always necessary, but it is when there is a lot of gene overlap. Here are some examples.

ERROR: valid [SEQ_FEAT.MissingGeneXref] Feature overlapped by 2 identical-length genes but has no cross-reference FEATURE: mRNA: NAIP_7 <25672> [(lcl|000124F_1_7476676_quiver_pilon:c80951-80883, c79013-78849, c76295-76173, c75198-74977)] [lcl|000124F_1_7476676_quiver_pilon: raw, dna len= 7482652] ERROR: valid [SEQ_FEAT.MissingGeneXref] Feature overlapped by 2 identical-length genes but has no cross-reference FEATURE: misc_feature: gap added in CDS to maintain frame, possibly due to error in genome <25673> [lcl|000124F_1_7476676_quiver_pilon:c80951-80950] [lcl|000124F_1_7476676_quiver_pilon: raw, dna len= 7482652] ERROR: valid [SEQ_FEAT.MissingGeneXref] Feature overlapped by 2 identical-length genes but has no cross-reference FEATURE: ncRNA: T0120387_1 <25678> [(lcl|000124F_1_7476676_quiver_pilon:c107627-107546, c105942-105891, c105610-105491, c102632-102533, c92531-92450, c91457-91400, c90545-88434, c84701-84534, c74910-74834)] [lcl|000124F_1_7476676_quiver_pilon: raw, dna len= 7482652]

[j] NoStop. It looks like the partial is at the wrong end for the minus strand features (at least some). Here's an example-

000104F_1_8731062_quiver_pilon

<1632815 1632609 CDS 1598218 1598045 1595272 1595173 1595067 1595060 codon_start 1 product FER_8 protein_id T0111856_8_prot transcript_id gnl|CK280|T0111856

BUT should have been like this because the stop codon is missing, not the start:

1632815 1632609 CDS 1598218 1598045 1595272 1595173 1595067 >1595060 codon_start 1 product FER_8 protein_id T0111856_8_prot transcript_id gnl|CK280|T0111856

 CDS             complement(join(1595060..1595067,1595173..1595272,
                 1598045..1598218,1632609..>1632815))
                 /locus_tag="CK280_G0031768"
                 /codon_start=1
                 /product="FER_8"
                 /translation="MGFGSDLKNSHEAVLKLQDWELRLLETVKKFMALRIKSDKEYAS
                 TLQNLCNQVDKESTVQMNYVSNVSKSWLLMIQQTEQLSRIMKTHAEDLNSGPLHRLTM
                 MIKDKQQVKKSYIGVHQQIEAEMIKVTKTELEKLKCSYRQLIKEMNSAKEKYKEALAK
                 DKK"

NOTE: the format is always that '<' is in column 1 and '>' is in column 2, for both plus AND minus strand features. See https://www.ncbi.nlm.nih.gov/genbank/examples.wgs/#complementary_strand.

[k] InconsistentPseudogeneValue. Expecting the same type (or absence) of pseudogene on the gene & its mRNAs/CDS. Here are some example errors.

WARNING: valid [SEQ_FEAT.InconsistentPseudogeneValue] Different pseudogene values on mRNA (unitary) and gene (unqualified) FEATURE: mRNA: RBFOX1_24 <1192> [(lcl|000100F_1_9097766_quiver_pilon:c6699122-6698832, c6470934-6470896, c6338496-6338259, c6338251-6337242)] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156] WARNING: valid [SEQ_FEAT.InconsistentPseudogeneValue] Different pseudogene values on mRNA (unitary) and gene (unqualified) FEATURE: mRNA: ZSCAN10_2 <1947> [(lcl|000100F_1_9097766_quiver_pilon:8819971-8820076, 8826258-8826720, 8827434-8827498, 8827665-8827722, 8828611-8830268)] [lcl|000100F_1_9097766_quiver_pilon: raw, dna len= 9108156]

diekhans commented 6 years ago

From: NCBI Genomes genomes@ncbi.nlm.nih.gov To: Mark Diekhans markd@soe.ucsc.edu CC: Ian Fiddes ian.t.fiddes@gmail.com, NCBI Genomes genomes@ncbi.nlm.nih.gov Subject: RE: annotation input for ape genome (NBAG00000000) Date: Fri, 20 Oct 2017 20:54:21 +0000

Hi Ian and Mark,

I ran your Clint_Chimp.V1.tbl file on just the first 100 sequences, because tbl2asn is unable to work when files get to be 2G or bigger. I looked at the errors and the annotation, and have a bunch of comments for you. I'm getting advice about some of the ncRNAs, so there might be a letter about some of those next week.

Please let us know of any questions or problems about today's comments below.

Thanks, Karen

[1] CDS product names

A. I see that you using the format _1, _2, etc.

Please use "protein isoform N" (where is the gene symbol).

eg "protein ARTN isoform 6", not "ARTN_6"

The RefSeq pipeline projects the human SwissProt names, with an "isoform N" suffix. That's an excellent name source for vertebrates -- very few names that aren't transferrable verbatim. They don't fuss over using the same "isoform N" identifier for the equivalent isoform in different species.

Can you adopt similar rules?

The request to use "protein " rather than just "" comes from the UniProt protein naming guidelines.

B. some are nucleotide accession.version, eg AP000146.1. That's not allowed, so please use the product name that's on the other record if it conforms to the UniProt guidelines OR use 'hypothetical protein' or 'uncharacterized protein'.

You can also add an inference to point to that nucleotide record (see https://www.ncbi.nlm.nih.gov/genbank/evidence/), eg:

inference   similar to DNA sequence|INSD|AP000146.1

C. some have the format "HGNC:ID_x". Instead, please use the Approved symbol (or Approved name if it conforms to the UniProt protein naming guidelines).

eg "protein NSG1" instead of "HGNC:18790_1"

[2] ncRNA product names

A. don't 'uniquify' the ncRNAs with _1, etc; just call them by their products.

examples:

count name

  1 LINC00309_1
  1 LINC00310_1
  1 LINC00310_2
  1 LINC00310_3
  6 U2_1
 49 U3_1

B. these are the only non-capitalized/non-gene symbol ones. Are they expected?

count name

  1 hsa-mir-1253_1
  1 hsa-mir-3119-1_1
  1 hsa-mir-3130-1_1
  1 hsa-mir-3158-1_1
  1 hsa-mir-3607_1
  1 hsa-mir-4536-1_1
  1 hsa-mir-4773-1_1
  1 hsa-mir-4776-1_1
  1 hsa-mir-548d-1_1
  1 hsa-mir-548d-2_1
  1 hsa-mir-550a-1_1
  1 hsa-mir-550a-2_1
  2 mascRNA-menRNA_1
  7 pRNA_1
  2 snoMBII-202_1
  3 snoMe28S-Am2634_1
  1 snoR1_1
  6 snoU109_1
 29 snoU13_1
  1 snoU18_1
  2 snoU2-30_1
  4 snoU2_19_1
  1 snoU83B_1
  1 snoZ196_1
  1 snoZ278_1
  1 snoZ40_1
  4 snoZ6_1
  1 snosnR66_1
 36 uc_338_1

C. isn't this actually "RNase_P_RNA" as the ncRNA_class & "RNase P RNA" without underscores as the product?

  7 pRNA_1

D. What's this? should the product be 'MIR338' with 'miRNA' as the ncRNA_class?

 36 uc_338_1

E. don't use systematic name for ncRNA product- T0109237_1. Especially when the class is 'other'. What is this?

2127 2019 gene locus_tag CK280_G0031091 pseudogene unprocessed 2127 2019 ncRNA transcript_id gnl|CK280|T0109237 ncRNA_class other product T0109237_1 protein_id T0109237_1_prot note CAT transcript id: T0109237 note CAT alignment id: ENST00000612457.1-0 note CAT source transcript id: ENST00000612457.1 note CAT source GENCODE transcript biotype: unprocessed_pseudogene

 gene            4039059..4044991
                 /locus_tag="CK280_G0031194"
 ncRNA           join(4039059..4039201,4044690..4044991)
                 /ncRNA_class="lncRNA"
                 /locus_tag="CK280_G0031194"
                 /product="T0109601_1"              
                 /note="CAT transcript id: T0109601;
                 CAT alignment id: ENST00000567103.1-0;
                 CAT source transcript id: ENST00000567103.1;
                 CAT source GENCODE transcript biotype: lincRNA"

Errors. [3] Data errors in the first 100 sequences, after making these changes:

[a] This CDS cannot be translated and has to be removed or flagged as /pseudo or /pseudogene in order to run tbl2asn

<9093066 9093064 CDS codon_start 1 product NPIPB12_4 protein_id T0110400_4_prot transcript_id gnl|CK280|T0110400

That's true for any CDS that just includes a stop codon or less than a full codon. (my test was just the first 100 sequences)

[b] There were 50 ShortIntron errors in the first 100 sequences of my test, so I ran it again and included "-c s" in the command line to automatically add the exception for short introns.

You should run:

tbl2asn -M n -c s -i First100.fsa -j "[organism=Pan troglodytes]" -t SubmissionTemplate.sbt -f Clint_Chimp.V1.kc.tbl -o First100.annot.sqn &

[4] Here is the list of errors with a brief description of what needs to be done. Additional detail for some of them is below.

I'll also try to post the full error file back to the SUB2821604 submission. I hope it's there now.

================================================================= 5780 ERROR-level messages exist

SEQ_FEAT.MissingGeneXref 19 FIX- include locus_tag on mRNAs & CDS to assure correct linkage to gene SEQ_FEAT.CDSmRNAXrefLocationProblem 2 FIX SEQ_FEAT.AbuttingIntervals 556 we need to quiet this when the 'low-quality sequence region' exception is present SEQ_FEAT.ShortIntron 7 we need to quiet the error for /pseudo & /pseudogene SEQ_FEAT.MissingQualOnFeature 2183 FIX- missing ncRNA_class on ncRNA features SEQ_FEAT.NoStop 935 FIX- partial symbols on minus-strand CDS SEQ_FEAT.PartialProblem 1771 FIX- have stop codon but 3' is partial SEQ_FEAT.MissingTrnaAA 4 FIX SEQ_FEAT.PseudoCdsViaGeneHasProduct 55 FIX- use 'note', not 'product' when gene is /pseudo or /pseudogene SEQ_FEAT.WrongQualOnFeature 31 FIX- illegal ncRNA_class on mRNA SEQ_FEAT.NoProtein 87 FIX- include product for non-pseudo/pseudogene CDS SEQ_FEAT.MissingCDSproduct 87 FIX (same as previous) SEQ_FEAT.FeatureProductInconsistency 43 FIX- I think this is a different report of those in the NoProtein category

================================================================= 41510 WARNING-level messages exist

SEQ_FEAT.FeatContentDup 424 FIX: misc_features describing the inserted gap in the CDS. Just 1 per gap SEQ_FEAT.mRNAgeneRange 1 FIX SEQ_FEAT.CDSgeneRange 1 FIX (same gene as previous error) SEQ_FEAT.InconsistentPseudogeneValue 259 FIX- use the same type of pseudogene on the gene & its parts SEQ_FEAT.CDSmRNAmismatch 17769 FIX. use "gnl|CK280|T0109603_prot" format for the protein_ids
SEQ_FEAT.DuplicateFeat 5 INVESTIGATE. Should genes move or be merged? SEQ_FEAT.CDSmRNArange 552 FIX. Use the same 'frameshift' in the mRNA that you put in the CDS SEQ_FEAT.PartialProblem 14703 FIX?