Additional release.gff3 stats

swarbred commented 4 years ago

Hi @cschu A minor enhancement Can we extract / summarise some additional details about the final release.gff3

Currently we produce:

urobe_EIv1.0.sanity_checked.release.gff3.mikado_stats.txt (full output of mikado stats) urobe_EIv1.0.sanity_checked.release.gff3.mikado_stats.txt.summary (summary of mikado stats)

urobe_EIv1.0.sanity_checked.release.gff3.pep.raw.protein_status.tsv (indication if the "protein" is complete) urobe_EIv1.0.sanity_checked.release.gff3.pep.raw.protein_status.summary (corresponding summary of protein completeness)

What generates the protein completeness info (gffread or is there a separate script), I couldn't tell from the logs?

Can we also

1) run mikado util stats with the --tab-stats option and create a .mikado_stats.tsv file

2) Can we also extract and summarise the biotype and confidence info i.e. generate a similar .tsv and .summary file i.e. give a summary of numbers of each combination of biotype and confidence class.

   3067 biotype=predicted_gene;confidence=Low
  12696 biotype=protein_coding_gene;confidence=High
   2340 biotype=protein_coding_gene;confidence=Low

3) Currently it looks like we are using the version of gffread in source cufflinks-2.2.1_gk. I don't think we are using cufflinks for anything other than gffread so we should switch to using https://github.com/gpertea/gffread as this has some additional options (we have gffread-0.11.6_CBG installed).

It looks like we run gff read to generate the fasta files in the same step we can also create a tbl.tsv file with some additional info based on the gff i.e. adding -P and --table , this overlaps with the mikado stats --tab-stats info but also pulls in some additional info (this includes partialness if this isn't the way you are currently generating this)

4) potentially these .tsv files from 1-3 could be combined into a single file (ignoring the duplicate info shared by mikado stats and the gffread table file), gemy currently does this for the final relase file see /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Data_Package/MYZPE13164_O_EIv2.1_Frozen_release_17Jan2020/EIv2.1/annotation/MYZPE13164_O_EIv2.1.annotation.gff3.transcript-stats.txt

this is mikado stats tab stats file with the confidence biotype info added so we would just be adding the InFrameStop (gffread) and partialness info (from your current process or gffread) to this

gffread urobe_EIv1.0.sanity_checked.release.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o test.tbl -g /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.3.1/mikado-2.0rc6_d094f99_CBG/GMC_R1/genome_assembly_ub-lib27480_disc.min1kb.basid-nhit.fas.release.fasta -W -w urobe_EIv1.0.sanity_checked.release.gff3.cdna.fasta -x urobe_EIv1.0.sanity_checked.release.gff3.cds.fasta -y urobe_EIv1.0.sanity_checked.release.gff3.pep.raw.fasta

cschu commented 4 years ago

Hi @swarbred ,

protein completeness information is generated by a port of Gemy's script (https://github.com/EI-CoreBioinformatics/gmc/blob/master/gmc/scripts/protein_completeness.py), which is run as a local rule.

1,2,3,4: I'll get on it.

cschu commented 4 years ago

1-3 are implemented in d26355f (gmc-0.4 on hpc)

cschu commented 4 years ago

As of c0cc780, gmc will generate a .final_table.tsv in the following format:

#1.TID #2.GID #3.Coordinates #4.Strand #5.Exon number #6.cDNA length #7.Intronic length #8.CDS length #9.# coding exons #10.cDNA/CDS ratio #11.5'UTR #12.5'UTR exons #13.3'UTR #14.3'UTR exons #15.Confidence #16.Biotype #17.Partialness #18.InFrameStop #19.Partialness_gr

Columns 1-14 are the mikado stats output. Columns 15-16 (confidence, biotype) Column 17 is partialness from protein_completeness Columns 18-19 are from the "new" gffread, however InFrameStop does not seem to have data and Partialness_gr (gr=gffread) does not correlate with the protein_completeness partialness.

swarbred commented 4 years ago

Columns 18-19 are from the "new" gffread, however InFrameStop does not seem to have data and Partialness_gr (gr=gffread) does not correlate with the protein_completeness partialness.

The inframestop is there as a check for me, we shouldn't have such cases if mikado is working as intended but we have had bugs before that have led to inframe stops.

copied from GENANNO-466

the results of the partialness script look to be odd (and disagree with the similar gffread stats)

3prime_partial 61408 fragment 12100 in-frame_stop 5

Looks like the issue is with the OREMO8127_EIv1.0.sanity_checked.release.gff3.pep.raw.fasta that's used as input (the info line is added to the header and no "." at the end of the seq). I think [~schudomc] at my request we have changed version of gffread and this is no longer compatible with the port of gemys script i.e. "." is no longer added to the sequence.

I suggest we run gffread twice once to extract the sequences and again to generate the --table option the alternative is to cean up the protein fasta file after generating. if the . is no longer added at the end of the sequence then we can remove our own partialness script as this is covered by the gffread --table output anyway. [~kaithakg] does that sound ok?

>OREMO8127_EIv1.0_0000010.1 Oreochromis_mossambicus_EIV1_scaffold_0000001   4336    9240    +   2   2201    1110    OREMO8127_EIv1.0_0000010    High    True    protein_coding_gen.
MFKTEKLRAVVNERLQAAAEEIFNLFEEAVRDYEKEVFRSKQEVEQQRRSLAEWKVPLQFSVQEDGACSE
QDHAEKKTVLCEDDREPQIKEEQEVELWSAPVCEKQEEADTKDAMFNIIYVERGEDGKHSENLSQTVRNG
EEDRSPSCSARTVKQSCGASEPTSECPSSEISDDEQEDSVKLHSGVNSKREEKSGETVEKPYTCSVCNKN
FRNKSILTRHMKTHTGEKPHSCSVCGKSFIQRSYLQTHMNSHSGQKPHTCSFCGKGFTQVGNMNAHMRIH
TGEKPHSCNDCGKKFREKADLIKHTVIHTGEKPYGCTLCHIKFSAQSNLTRHMKTHSGERPYSCTACGKR

gemygk commented 4 years ago

Posting comments from JIRA here:

Hi [~swarbred]

If the --table output has the information then we can avoid the script. Do you know if --table format gives in-frame stop stats?

But it looks like the latest version of gffread (gffread-0.11.6_CBG) does not add . or to stop codon. There is an option to chose the stop codon symbol. {code} -S for -y option, use '' instead of '.' as stop codon translation {code}

Thanks, Gemy

David's reply:

Gemy Kaithakottil Yes if present and you indicate the correct flag (which we do)

But it looks like the latest version of gffread (gffread-0.11.6_CBG) does not add . or * to stop codon. There is an option to chose the stop codon symbol.

From my checks it's not being added to the end of the sequence so I assume this has changed and this only refers to internal stops.

cschu commented 4 years ago

Ok, separating the gffread runs still did not add '*' (nor '.') to the end of the protein sequences (same behaviour with and without -S).

I had a look at the gpertea-gffread code, finding that it has explicit instructions to prevent the stop codons from ever printing. I removed these three instructions in our gmc-dependencies container definition:

mv gffread.cpp gffread.cpp.bak
head -n 648 gffread.cpp.bak >> gffread.cpp
tail -n +649 gffread.cpp.bak | head -n 1 | sed "s/^/\/\//" >> gffread.cpp
tail -n +650 gffread.cpp.bak | head -n 6 >> gffread.cpp
tail -n +656 gffread.cpp.bak | head -n 1 | sed "s/^/\/\//" >> gffread.cpp
tail -n +657 gffread.cpp.bak | head -n 89 >> gffread.cpp
tail -n +746 gffread.cpp.bak | head -n 1 | sed "s/^/\/\//" >> gffread.cpp
tail -n +747 gffread.cpp.bak >> gffread.cpp
rm gffread.cpp.bak

Long story short, gmc-0.5 (e6892d5) now has its raw peptides terminated with dots, but these will be lost after processing with prinseq. The question is: do we need to process with prinseq? gffread already does 70aa / line, it just doesn't filter empty sequences (would we have those at all?)

The partialness data in the final table is now derived from gffread (renaming the categories to be more readable). Does this mean, we can get rid of the protein completeness analysis step?

@gemygk @swarbred

swarbred commented 4 years ago

@cschu @gemygk

I had a look at the gpertea-gffread code, finding that it has explicit instructions to prevent the stop codons from ever printing.

Yes that's what i assumed i.e. they have made a change to not indicate a stop at the end of the sequence. That seems sensible and in agreement with what most other databases provide (e.g. if you down load protein seqs from ensembl), so I'm fine with this, it's only an issue as Gemys script for the completeness utilised this. If we use the --table option with the InFrameStop and partialness tag then we get what we want in the --table output and don't need to use the original script.

I have checked and you still get internal stops marked in the fasta sequence (here as ".") if they are present (with the gffread-0.11.6_CBG version) and the InFrameStop stet to true in the table output e.g.

scaffold_6      25975825        25980350        +       17      2690    2320    mikado.scaffold_6G1.1   .       .       .       .       true    .

>mikado.scaffold_6G1.1 Name=mikado.scaffold_6G1.1 alias=polished.pacbio_scaffold_6_region_C_LQ_sampleWzA5U6ZI|cb6490_c28/f1p14/2806.mrna1_isplit.0 canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 canonical_number=16 canonical_proportion=1.0 has_start_codon=True has_stop_codon=True primary=True multiexonic=True superlocus=Mikado_superlocus:scaffold_6+:25975825-25980350 InFrameStop=true
MSTYEEFIEQNEERDGVRLSWNVWPASRLEATRLVVPVGCLYQPLKERPGFTSNTV.SCTMYT.YLSSNS
KSFMSGGL.RKNVDM.FLFSM.FFSSSIWSNFRSTPTS.TYSCIFYH.ILNNES.NHATDIFVCCGYLYG
..RTWCS.RLFTNVS.FITTKCSCWLNYIWKCCSGPRIGL.RYFQVICV.RHKRLTTETNSGYIRNRKIC
CSNCSSWCASTASTTTATS.CTSKQIYTAGAKL.CEYVRLIK.AFKRSVACFTRNETFKEHRDGIINCC.
LIGVYVSTNWCTYNAVCWWSMFPRSRSCS..KVIRYNSFP.PNNER.CSIYEKSN.TLRYLSSSCCNQWT
YN.YICVCFRSNWIDGNEAML.YNWWPYGNG.FIQFIFIQANISKGIGSRS.KQFKNGIQCYLGSKNI.R
IENYGCNRTMRIIKYQKPMCI.L.LRFRWYMSMENV.LDA.HYMCIFLRNCQPTWITYTAGWTWLHSVYY
SVSTFQWL.TN.SHHIS.ELGRPCSEYDAC.CCI.SRSICRFNGSYGSEPC.N.G.SRCDALG.SYAYTL
VSKIW.LSKR.SK.FPIARKLQFISTVHVSFKKVSISTSF...S..NIIL.AHVDA.RCYPKFNHDTANS
V.L.F.W.ARTCTFGYQ.YST..NIIDGHIFPYFDIPWRDYCSMESNGLSK.TRV..LQAVASSPR..CS
GNSQNSIPNASVY.HRTRW.SGKIFTMQSKPISNT..YVCLWR.WWSTSFDR.CKLATVYGAFEKASCLI
QCI

I am assuming that the partialness info reported is correct (no reason to think otherwise) that being the case then we can just remove the step of using our own script for partialness and NOT make any changes to gffread to add in the terminal stops.

On the second issue of the table output line being added to the fasta header (see seq from earlier comment) then this seems to be only an issue in the gmc gffread version installed if you run this version and compare to running gffread-0.11.6_CBG then the gffread-0.11.6_CBG output is correct. So we should be able to extract the seqs and generate the table output in a single command with no issue

see output at /ei/projects/fabef620-281f-4a74-beb6-9adbe4f03af8/data/results/CB-GENANNO-466_Oreochromis_mossambicus_annotation/Analysis/gmc/gmc-0.4/mikado-2.0rc6_3f086b5e_CBG/GMC_R1/GMC_run1

from running

source gffread-0.11.6_CBG && gffread test.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o gffread-0.11.6_CBG_test.tbl -g ../Oreochromis_mossambicus_EIV1.fa -W -w gffread-0.11.6_CBG_cdna.fasta -x gffread-0.11.6_CBG_cds.fasta -y gffread-0.11.6_CBG_pep.fasta

and

singularity exec /ei/software/cb/containers/gmc/x86_64/Singularity.img gffread test.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o gmc_gffread_test.tbl -g ../Oreochromis_mossambicus_EIV1.fa -W -w gmc_gffread_cdna.fasta -x gmc_gffread_cds.fasta -y gmc_gffread_pep.fasta

see the difference in the fasta headers

I thought the gffread version running in gmc was the same as gffread-0.11.6_CBG but this doesn't appear to be the case.

gemygk commented 4 years ago

@swarbred @cschu

When we create the final protein models i.e. for external releases, we should NOT have any stop codon ('.' or '*') in the file. Many downstream processing tools (blast, interproscan, like) do not like this stop codon.

As @swarbred mentioned, I had used the stop codon feature to work out the inframe stops and completeness of the proteins. Now that we get all the information with the latest gffread we do not need my protein completeness script.

On another important note (This is what we discussed @cschu, I am not creating a new issue for this): we should run gffread twice:

First, to get all the transcripts (cDNA as well as ncRNA's if present, or to get transcripts that lacks CDS features) source gffread-0.11.6_CBG && gffread test.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o gffread-0.11.6_CBG_test.tbl -g ../Oreochromis_mossambicus_EIV1.fa -W -w gffread-0.11.6_CBG_cdna.fasta and
Second, to get all the CDS and protein fasta. source gffread-0.11.6_CBG && gffread test.gff3 -g ../Oreochromis_mossambicus_EIV1.fa -W -x gffread-0.11.6_CBG_cds.fasta -y gffread-0.11.6_CBG_pep.fasta

The reason for this is that, if we have an input GFF with both mRNA and ncRNA features, then for example, by running gffread test.gff3 -g test.genome.fa -W -w test.cdna.fasta -x test.cds.fasta -y test.pep.fasta in the 'test.cdna.fasta' we only get the mRNA fasta sequences (those have CDS) and not the ncRNA fasta sequences, hence we need two steps. This was the issue with the gffread version bundled with cufflinks-2.2.1_CBG and I have not tested it on the current gffread-0.11.6_CBG version.

cschu commented 4 years ago

@gemygk @swarbred

Ok, 0.5 (a2604c3) now uses gffread-0.11.6 from the gmc dependencies container and adds in the partialness from gffread with the more descriptive categories (complete, 5prime_partial, 3prime_partial, fragment instead of ., 5, 3, 5_3). I will now remove the execution of Gemy's partialness script as well unless you want to retain the protein completeness analysis for the post pick proteins.

swarbred commented 4 years ago

On another important note (This is what we discussed @cschu, I am not creating a new issue for this): we should run gffread twice:

The reason for this is that, if we have an input GFF with both mRNA and ncRNA features, then for example, by running gffread test.gff3 -g test.genome.fa -W -w test.cdna.fasta -x test.cds.fasta -y test.pep.fasta in the 'test.cdna.fasta' we only get the mRNA fasta sequences (those have CDS) and not the ncRNA fasta sequences, hence we need two steps. This was the issue with the gffread version bundled with cufflinks-2.2.1_CBG and I have not tested it on the current gffread-0.11.6_CBG version.

This is not an issue in gffread-0.11.6_CBG so you can do this in one gffread job

cschu commented 4 years ago

in 1e70c19

swarbred commented 4 years ago

@cschu Looks like we still generate the protein completeness for the mikado pick models using the original apprach i.e.

mikado.annotation.protein_status.summary

3prime_partial 28124 fragment 1649

These will always be 3'partial as now no stop is included by gffread so we should generate the gff read table file on this as well rather than running the original script.

cschu commented 4 years ago

Hi @swarbred ,

Ok, that answers my question from above :)

I will now remove the execution of Gemy's partialness script as well unless you want to retain the protein completeness analysis for the post pick proteins.

gemygk commented 4 years ago

Hi @cschu Below is what we provide as the summary for the release models, which we could add to the *.release.gff3.biotype_conf.summary

Biotype	Confidence	Gene	Transcript
protein_coding_gene	High	21,301	39,545
transposable_element_gene	High	5,853	6,035
predicted_gene	Low	621	638
protein_coding_gene	Low	1,520	1,583
transposable_element_gene	Low	601	605
	Total	29,896	48,406

swarbred commented 4 years ago

Yes useful to have this at gene level as well

cschu commented 4 years ago

Do we need the ABC_EIv2.sanity_checked.release.gff3.biotype_conf.tsv, which only contains this:

#1.transcript_id        #2.biotype      #3.confidence
ABC_EIv2_0000010.1      protein_coding_gene     Low
ABC_EIv2_0000020.1      protein_coding_gene     Low
ABC_EIv2_0000030.1      protein_coding_gene     Low
ABC_EIv2_0000040.1      protein_coding_gene     High
ABC_EIv2_0000050.1      protein_coding_gene     Low

(I assume not)

The other question is: should the gene biotype, confidence also go into the final summary table?

swarbred commented 4 years ago

@cschu Correct we can do away with ABC_EIv2.sanity_checked.release.gff3.biotype_conf.tsv as we have this in the final_table.tsv

The other question is: should the gene biotype, confidence also go into the final summary table?

Sorry I'm not clear exactly what you are referring to by final summary?

cschu commented 4 years ago

Sorry I'm not clear exactly what you are referring to by final summary?

I mean the big, final table: #1.TID #2.GID #3.Coordinates #4.Strand #5.Exon number #6.cDNA length #7.Intronic length #8.CDS length #9.# coding exons #10.cDNA/CDS ratio #11.5'UTR #12.5'UTR exons #13.3'UTR #14.3'UTR exons #15.Confidence #16.Biotype #17.InFrameStop #18.Partialness

swarbred commented 4 years ago

I think for us biotype and confidence are aways a gene level attribute i.e.when calculating these we have already assigned them to genes rather than transcripts (i.e. alternative models of the same gene always have the same biotype and confidence class). That being the case we dont need a separate gene biotype / confidence in the *final_table.tsv .

swarbred commented 4 years ago

Can be closed

cschu commented 4 years ago

last implemented in 4c3b052

EI-CoreBioinformatics / minos

Additional release.gff3 stats #14