Closed swarbred closed 4 years ago
Hi @swarbred ,
protein completeness information is generated by a port of Gemy's script (https://github.com/EI-CoreBioinformatics/gmc/blob/master/gmc/scripts/protein_completeness.py), which is run as a local rule.
1,2,3,4: I'll get on it.
1-3 are implemented in d26355f (gmc-0.4 on hpc)
As of c0cc780, gmc will generate a
#1.TID #2.GID #3.Coordinates #4.Strand #5.Exon number #6.cDNA length #7.Intronic length #8.CDS length #9.# coding exons #10.cDNA/CDS ratio #11.5'UTR #12.5'UTR exons #13.3'UTR #14.3'UTR exons #15.Confidence #16.Biotype #17.Partialness #18.InFrameStop #19.Partialness_gr
Columns 1-14 are the mikado stats output. Columns 15-16 (confidence, biotype) Column 17 is partialness from protein_completeness Columns 18-19 are from the "new" gffread, however InFrameStop does not seem to have data and Partialness_gr (gr=gffread) does not correlate with the protein_completeness partialness.
Columns 18-19 are from the "new" gffread, however InFrameStop does not seem to have data and Partialness_gr (gr=gffread) does not correlate with the protein_completeness partialness.
The inframestop is there as a check for me, we shouldn't have such cases if mikado is working as intended but we have had bugs before that have led to inframe stops.
copied from GENANNO-466
the results of the partialness script look to be odd (and disagree with the similar gffread stats)
3prime_partial 61408 fragment 12100 in-frame_stop 5
Looks like the issue is with the OREMO8127_EIv1.0.sanity_checked.release.gff3.pep.raw.fasta that's used as input (the info line is added to the header and no "." at the end of the seq). I think [~schudomc] at my request we have changed version of gffread and this is no longer compatible with the port of gemys script i.e. "." is no longer added to the sequence.
I suggest we run gffread twice once to extract the sequences and again to generate the --table option the alternative is to cean up the protein fasta file after generating. if the . is no longer added at the end of the sequence then we can remove our own partialness script as this is covered by the gffread --table output anyway. [~kaithakg] does that sound ok?
>OREMO8127_EIv1.0_0000010.1 Oreochromis_mossambicus_EIV1_scaffold_0000001 4336 9240 + 2 2201 1110 OREMO8127_EIv1.0_0000010 High True protein_coding_gen.
MFKTEKLRAVVNERLQAAAEEIFNLFEEAVRDYEKEVFRSKQEVEQQRRSLAEWKVPLQFSVQEDGACSE
QDHAEKKTVLCEDDREPQIKEEQEVELWSAPVCEKQEEADTKDAMFNIIYVERGEDGKHSENLSQTVRNG
EEDRSPSCSARTVKQSCGASEPTSECPSSEISDDEQEDSVKLHSGVNSKREEKSGETVEKPYTCSVCNKN
FRNKSILTRHMKTHTGEKPHSCSVCGKSFIQRSYLQTHMNSHSGQKPHTCSFCGKGFTQVGNMNAHMRIH
TGEKPHSCNDCGKKFREKADLIKHTVIHTGEKPYGCTLCHIKFSAQSNLTRHMKTHSGERPYSCTACGKR
Posting comments from JIRA here:
Hi [~swarbred]
If the --table output has the information then we can avoid the script. Do you know if --table format gives in-frame stop stats?
But it looks like the latest version of gffread (gffread-0.11.6_CBG) does not add . or to stop codon. There is an option to chose the stop codon symbol. {code} -S for -y option, use '' instead of '.' as stop codon translation {code}
Thanks, Gemy
David's reply:
Gemy Kaithakottil Yes if present and you indicate the correct flag (which we do)
But it looks like the latest version of gffread (gffread-0.11.6_CBG) does not add . or * to stop codon. There is an option to chose the stop codon symbol.
From my checks it's not being added to the end of the sequence so I assume this has changed and this only refers to internal stops.
Ok, separating the gffread runs still did not add '*' (nor '.') to the end of the protein sequences (same behaviour with and without -S).
I had a look at the gpertea-gffread code, finding that it has explicit instructions to prevent the stop codons from ever printing. I removed these three instructions in our gmc-dependencies container definition:
mv gffread.cpp gffread.cpp.bak
head -n 648 gffread.cpp.bak >> gffread.cpp
tail -n +649 gffread.cpp.bak | head -n 1 | sed "s/^/\/\//" >> gffread.cpp
tail -n +650 gffread.cpp.bak | head -n 6 >> gffread.cpp
tail -n +656 gffread.cpp.bak | head -n 1 | sed "s/^/\/\//" >> gffread.cpp
tail -n +657 gffread.cpp.bak | head -n 89 >> gffread.cpp
tail -n +746 gffread.cpp.bak | head -n 1 | sed "s/^/\/\//" >> gffread.cpp
tail -n +747 gffread.cpp.bak >> gffread.cpp
rm gffread.cpp.bak
Long story short, gmc-0.5 (e6892d5) now has its raw peptides terminated with dots, but these will be lost after processing with prinseq. The question is: do we need to process with prinseq? gffread already does 70aa / line, it just doesn't filter empty sequences (would we have those at all?)
The partialness data in the final table is now derived from gffread (renaming the categories to be more readable). Does this mean, we can get rid of the protein completeness analysis step?
@gemygk @swarbred
@cschu @gemygk
I had a look at the gpertea-gffread code, finding that it has explicit instructions to prevent the stop codons from ever printing.
Yes that's what i assumed i.e. they have made a change to not indicate a stop at the end of the sequence. That seems sensible and in agreement with what most other databases provide (e.g. if you down load protein seqs from ensembl), so I'm fine with this, it's only an issue as Gemys script for the completeness utilised this. If we use the --table option with the InFrameStop and partialness tag then we get what we want in the --table output and don't need to use the original script.
I have checked and you still get internal stops marked in the fasta sequence (here as ".") if they are present (with the gffread-0.11.6_CBG version) and the InFrameStop stet to true in the table output e.g.
scaffold_6 25975825 25980350 + 17 2690 2320 mikado.scaffold_6G1.1 . . . . true .
>mikado.scaffold_6G1.1 Name=mikado.scaffold_6G1.1 alias=polished.pacbio_scaffold_6_region_C_LQ_sampleWzA5U6ZI|cb6490_c28/f1p14/2806.mrna1_isplit.0 canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 canonical_number=16 canonical_proportion=1.0 has_start_codon=True has_stop_codon=True primary=True multiexonic=True superlocus=Mikado_superlocus:scaffold_6+:25975825-25980350 InFrameStop=true
MSTYEEFIEQNEERDGVRLSWNVWPASRLEATRLVVPVGCLYQPLKERPGFTSNTV.SCTMYT.YLSSNS
KSFMSGGL.RKNVDM.FLFSM.FFSSSIWSNFRSTPTS.TYSCIFYH.ILNNES.NHATDIFVCCGYLYG
..RTWCS.RLFTNVS.FITTKCSCWLNYIWKCCSGPRIGL.RYFQVICV.RHKRLTTETNSGYIRNRKIC
CSNCSSWCASTASTTTATS.CTSKQIYTAGAKL.CEYVRLIK.AFKRSVACFTRNETFKEHRDGIINCC.
LIGVYVSTNWCTYNAVCWWSMFPRSRSCS..KVIRYNSFP.PNNER.CSIYEKSN.TLRYLSSSCCNQWT
YN.YICVCFRSNWIDGNEAML.YNWWPYGNG.FIQFIFIQANISKGIGSRS.KQFKNGIQCYLGSKNI.R
IENYGCNRTMRIIKYQKPMCI.L.LRFRWYMSMENV.LDA.HYMCIFLRNCQPTWITYTAGWTWLHSVYY
SVSTFQWL.TN.SHHIS.ELGRPCSEYDAC.CCI.SRSICRFNGSYGSEPC.N.G.SRCDALG.SYAYTL
VSKIW.LSKR.SK.FPIARKLQFISTVHVSFKKVSISTSF...S..NIIL.AHVDA.RCYPKFNHDTANS
V.L.F.W.ARTCTFGYQ.YST..NIIDGHIFPYFDIPWRDYCSMESNGLSK.TRV..LQAVASSPR..CS
GNSQNSIPNASVY.HRTRW.SGKIFTMQSKPISNT..YVCLWR.WWSTSFDR.CKLATVYGAFEKASCLI
QCI
I am assuming that the partialness info reported is correct (no reason to think otherwise) that being the case then we can just remove the step of using our own script for partialness and NOT make any changes to gffread to add in the terminal stops.
On the second issue of the table output line being added to the fasta header (see seq from earlier comment) then this seems to be only an issue in the gmc gffread version installed if you run this version and compare to running gffread-0.11.6_CBG then the gffread-0.11.6_CBG output is correct. So we should be able to extract the seqs and generate the table output in a single command with no issue
see output at /ei/projects/fabef620-281f-4a74-beb6-9adbe4f03af8/data/results/CB-GENANNO-466_Oreochromis_mossambicus_annotation/Analysis/gmc/gmc-0.4/mikado-2.0rc6_3f086b5e_CBG/GMC_R1/GMC_run1
from running
source gffread-0.11.6_CBG && gffread test.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o gffread-0.11.6_CBG_test.tbl -g ../Oreochromis_mossambicus_EIV1.fa -W -w gffread-0.11.6_CBG_cdna.fasta -x gffread-0.11.6_CBG_cds.fasta -y gffread-0.11.6_CBG_pep.fasta
and
singularity exec /ei/software/cb/containers/gmc/x86_64/Singularity.img gffread test.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o gmc_gffread_test.tbl -g ../Oreochromis_mossambicus_EIV1.fa -W -w gmc_gffread_cdna.fasta -x gmc_gffread_cds.fasta -y gmc_gffread_pep.fasta
see the difference in the fasta headers
I thought the gffread version running in gmc was the same as gffread-0.11.6_CBG but this doesn't appear to be the case.
@swarbred @cschu
When we create the final protein models i.e. for external releases, we should NOT have any stop codon ('.' or '*') in the file. Many downstream processing tools (blast, interproscan, like) do not like this stop codon.
As @swarbred mentioned, I had used the stop codon feature to work out the inframe stops and completeness of the proteins. Now that we get all the information with the latest gffread we do not need my protein completeness script.
On another important note (This is what we discussed @cschu, I am not creating a new issue for this): we should run gffread twice:
source gffread-0.11.6_CBG && gffread test.gff3 -P --table @chr,@start,@end,@strand,@numexons,@covlen,@cdslen,ID,Note,confidence,representative,biotype,InFrameStop,partialness -o gffread-0.11.6_CBG_test.tbl -g ../Oreochromis_mossambicus_EIV1.fa -W -w gffread-0.11.6_CBG_cdna.fasta
and source gffread-0.11.6_CBG && gffread test.gff3 -g ../Oreochromis_mossambicus_EIV1.fa -W -x gffread-0.11.6_CBG_cds.fasta -y gffread-0.11.6_CBG_pep.fasta
The reason for this is that, if we have an input GFF with both mRNA and ncRNA features, then for example, by running
gffread test.gff3 -g test.genome.fa -W -w test.cdna.fasta -x test.cds.fasta -y test.pep.fasta
in the 'test.cdna.fasta' we only get the mRNA fasta sequences (those have CDS) and not the ncRNA fasta sequences, hence we need two steps. This was the issue with the gffread version bundled with cufflinks-2.2.1_CBG and I have not tested it on the current gffread-0.11.6_CBG version.
@gemygk @swarbred
Ok, 0.5 (a2604c3) now uses gffread-0.11.6 from the gmc dependencies container and adds in the partialness from gffread with the more descriptive categories (complete, 5prime_partial, 3prime_partial, fragment instead of ., 5, 3, 5_3). I will now remove the execution of Gemy's partialness script as well unless you want to retain the protein completeness analysis for the post pick proteins.
On another important note (This is what we discussed @cschu, I am not creating a new issue for this): we should run gffread twice:
The reason for this is that, if we have an input GFF with both mRNA and ncRNA features, then for example, by running gffread test.gff3 -g test.genome.fa -W -w test.cdna.fasta -x test.cds.fasta -y test.pep.fasta in the 'test.cdna.fasta' we only get the mRNA fasta sequences (those have CDS) and not the ncRNA fasta sequences, hence we need two steps. This was the issue with the gffread version bundled with cufflinks-2.2.1_CBG and I have not tested it on the current gffread-0.11.6_CBG version.
This is not an issue in gffread-0.11.6_CBG so you can do this in one gffread job
in 1e70c19
@cschu Looks like we still generate the protein completeness for the mikado pick models using the original apprach i.e.
mikado.annotation.protein_status.summary
3prime_partial 28124 fragment 1649
These will always be 3'partial as now no stop is included by gffread so we should generate the gff read table file on this as well rather than running the original script.
Hi @swarbred ,
Ok, that answers my question from above :)
I will now remove the execution of Gemy's partialness script as well unless you want to retain the protein completeness analysis for the post pick proteins.
Hi @cschu
Below is what we provide as the summary for the release models, which we could add to the *.release.gff3.biotype_conf.summary
Biotype | Confidence | Gene | Transcript |
---|---|---|---|
protein_coding_gene | High | 21,301 | 39,545 |
transposable_element_gene | High | 5,853 | 6,035 |
predicted_gene | Low | 621 | 638 |
protein_coding_gene | Low | 1,520 | 1,583 |
transposable_element_gene | Low | 601 | 605 |
Total | 29,896 | 48,406 |
Yes useful to have this at gene level as well
Do we need the ABC_EIv2.sanity_checked.release.gff3.biotype_conf.tsv, which only contains this:
#1.transcript_id #2.biotype #3.confidence
ABC_EIv2_0000010.1 protein_coding_gene Low
ABC_EIv2_0000020.1 protein_coding_gene Low
ABC_EIv2_0000030.1 protein_coding_gene Low
ABC_EIv2_0000040.1 protein_coding_gene High
ABC_EIv2_0000050.1 protein_coding_gene Low
(I assume not)
The other question is: should the gene biotype, confidence also go into the final summary table?
@cschu Correct we can do away with ABC_EIv2.sanity_checked.release.gff3.biotype_conf.tsv as we have this in the final_table.tsv
The other question is: should the gene biotype, confidence also go into the final summary table?
Sorry I'm not clear exactly what you are referring to by final summary?
Sorry I'm not clear exactly what you are referring to by final summary?
I mean the big, final table:
#1.TID #2.GID #3.Coordinates #4.Strand #5.Exon number #6.cDNA length #7.Intronic length #8.CDS length #9.# coding exons #10.cDNA/CDS ratio #11.5'UTR #12.5'UTR exons #13.3'UTR #14.3'UTR exons #15.Confidence #16.Biotype #17.InFrameStop #18.Partialness
I think for us biotype and confidence are aways a gene level attribute i.e.when calculating these we have already assigned them to genes rather than transcripts (i.e. alternative models of the same gene always have the same biotype and confidence class). That being the case we dont need a separate gene biotype / confidence in the *final_table.tsv .
Can be closed
last implemented in 4c3b052
Hi @cschu A minor enhancement Can we extract / summarise some additional details about the final release.gff3
Currently we produce:
urobe_EIv1.0.sanity_checked.release.gff3.mikado_stats.txt (full output of mikado stats) urobe_EIv1.0.sanity_checked.release.gff3.mikado_stats.txt.summary (summary of mikado stats)
urobe_EIv1.0.sanity_checked.release.gff3.pep.raw.protein_status.tsv (indication if the "protein" is complete) urobe_EIv1.0.sanity_checked.release.gff3.pep.raw.protein_status.summary (corresponding summary of protein completeness)
What generates the protein completeness info (gffread or is there a separate script), I couldn't tell from the logs?
Can we also
1) run mikado util stats with the --tab-stats option and create a .mikado_stats.tsv file
2) Can we also extract and summarise the biotype and confidence info i.e. generate a similar .tsv and .summary file i.e. give a summary of numbers of each combination of biotype and confidence class.
3) Currently it looks like we are using the version of gffread in source cufflinks-2.2.1_gk. I don't think we are using cufflinks for anything other than gffread so we should switch to using https://github.com/gpertea/gffread as this has some additional options (we have gffread-0.11.6_CBG installed).
It looks like we run gff read to generate the fasta files in the same step we can also create a tbl.tsv file with some additional info based on the gff i.e. adding -P and --table , this overlaps with the mikado stats --tab-stats info but also pulls in some additional info (this includes partialness if this isn't the way you are currently generating this)
4) potentially these .tsv files from 1-3 could be combined into a single file (ignoring the duplicate info shared by mikado stats and the gffread table file), gemy currently does this for the final relase file see /ei/workarea/group-ga/Projects/CB-GENANNO-444_Myzus_persicae_clone_O_v2_annotation/Data_Package/MYZPE13164_O_EIv2.1_Frozen_release_17Jan2020/EIv2.1/annotation/MYZPE13164_O_EIv2.1.annotation.gff3.transcript-stats.txt
this is mikado stats tab stats file with the confidence biotype info added so we would just be adding the InFrameStop (gffread) and partialness info (from your current process or gffread) to this