Closed swarbred closed 4 years ago
Can easily take the rules from qaa, so should be no issue from the technical side. Not sure how we can get @gemygk's BUSCO modification when it comes down to doing a conda install, but other than that, I don't see a problem. Splitting or not splitting also -- at first thought -- should be fine.
@swarbred The idea of adding BUSCO as a metric is great. But as you mentioned it could be viewed as artificial, but again given that we have a model to represent the BUSCO set and favouring it has its merits.
I think, apart from the points you mentioned, are we planning to incorporate the BUSCO models that are unique to the genome and pass them to integration stage?
Running BUSCO on protein set is quite quick but for transcripts, it takes some time. So splitting the transcripts and later combining them would be the ideal choice.
For example: The runtime for Koala (93698 transcripts) Protein run - ~30 mins using 32 threads Transcriptome run - ~ 5hrs 30 mins using 32 threads
I think, apart from the points you mentioned, are we planning to incorporate the BUSCO models that are unique to the genome and pass them to integration stage?
There is merit in that but i wasn't thinking of tackiling this at the current time.
Running BUSCO on protein set is quite quick but for transcripts, it takes some time. So splitting the transcripts and later combining them would be the ideal choice.
From discussing this it seems the BUSCO run against the genome is not the most robust and has a substantially longer run time, both are reasons for not incorporating this directly into gmc. It is however something we will always run (so for our convenience this would be nice to have) . Certainly the default would be not to run this, an alternative would be to provide the BUSCO output form a genome run as an input., just to enable us to bring the BUSCO results from a genome run, a mikado prepare stage (transcript and protein) and final annotation (transcript and protein) together and make some automated comparisons. The final alternative is to just exclude it completely.
The transcriptome run runtime suggests that it would be best to split this by label and then run individually, that also provides more useful info about the individual gene sets.
Which BUSCO runs are run would need to be configurable and the default would likely be to just run the proteins runs on the prepare output and final annotations.
The transcriptome run runtime suggests that it would be best to split this by label and then run individually, that also provides more useful info about the individual gene sets.
What does "split this by label" mean?
Sorry, i mean for the busco transcript run on the mikado prepare output , to split the mikado prepare fasta file in to seperate files based on the label that has been assigned to the transcripts.i.e the label in the list file used in mikado configure . Then run busco on each of these.We should also split the fasta for the busco protein run on the prepare output.
Running seperately will give us busco results for each annotation set contained in the prepare output. Those individual results are useful but we would also want to present the numbers contained duplicated missing across all sets so this can be compared to the busco run on the final annotation (which is a single transcript and protein run i.e no spliting).
Sent from Outlook Mobilehttps://aka.ms/blhgte
From: Christian Schudoma notifications@github.com Sent: Thursday, April 9, 2020 12:45:45 AM To: EI-CoreBioinformatics/gmc gmc@noreply.github.com Cc: David Swarbreck (EI) David.Swarbreck@earlham.ac.uk; Mention mention@noreply.github.com Subject: Re: [EI-CoreBioinformatics/gmc] Add BUSCO rules (#17)
The transcriptome run runtime suggests that it would be best to split this by label and then run individually, that also provides more useful info about the individual gene sets.
What does "split this by label" mean?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/EI-CoreBioinformatics/gmc/issues/17#issuecomment-611250057, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAY5ZYIYJH6JT2UCGQ27J3DRLUECTANCNFSM4L7GQXVA.
@swarbred @gemygk
Minor design decision: do you want the used busco levels as
a) comma-separated list
--busco-level proteins,transcripts,genome
or
b) multiple parameters
--busco-proteins, --busco-transcripts, --busco-genome
or
c) the same parameter that can be specified multiple times
--busco-level proteins --busco-level transcripts --busco-level genome
?
@cschu @swarbred
I would go with option a)
Also, what about
--busco-mode proteins,transcriptome,genome
to keep inline with the BUSCO parameters and options
Yea, agreed. But I was also thinking to allow multiple aliases for the different modes (busco-mode-names, t, p, g, etc., for lazy typists ;))
Also, what about --busco-mode proteins,transcriptome,genome to keep inline with the BUSCO parameters and options
No strong preference but I would vote a) perhaps with an "all" option for proteins,transcripts,genome, I assume not specifying --busco-level would turn this off
I would add "all", "none"/"off" and the default setting would be just proteins.
No strong preference but I would vote a) perhaps with an "all" option for proteins,transcripts,genome, I assume not specifying --busco-level would turn this off
Status right now is : 0.6.1 has proteins (prepare, final), transcripts (prepare, final) and genome, but genome doesn't work yet (where I don't know whether this is because of the file system issues). Currently, these are just run and not used / summarised, but that will be next.
@swarbred @gemygk I have now fine-tuned this to Busco 4 (odb10 databases are in group-pb/BUSCO-databases/odb10). There are some differences to Busco 3 regarding command line usage. Supporting both versions should be possible but would require some remodeling of the snakemake rules. Unless you really want/need it, I would drop the BUSCO 3 support.
The output file/directory structure for Busco4 is also quite different.
@cschu I'm not familiar with the BUSCO3 / 4 differences but I don't feel we need to support both, BUSCO 4 would be fine.
@swarbred If we were to provide a premade genome busco run, what data would that encompass? Only the summary stats?
@cschu We will need the
short_summary.*.txt and full_table.tsv
Hi @cschu
So my run for GENANNO-457 completed with 0.7, everything busco completed and I'll add some details later regarding summary info we want to reformat/extract. This is really useful, so thanks for adding.
While looking into another issue (will comment in a separate ticket) I tried rerunning your GMC_dev_testdata_MAZ test set with a few modification. When trying to turn off the busco analysis I just removed --busco-lineage and added --busco-level off
this then gives an error
Full Traceback (most recent call last):
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/__init__.py", line 496, in snakemake
snakefile, overwrite_first_rule=True, print_compilation=print_compilation
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/workflow.py", line 987, in include
exec(compile(code, snakefile, "exec"), self.globals)
File "/ei/software/testing/gmc/0.7/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py", line 71, in <module>
BUSCO_LINEAGE = os.path.basename(config["busco_analyses"]["lineage"])
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/posixpath.py", line 146, in basename
p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType
TypeError in line 71 of /ei/software/testing/gmc/0.7/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py:
expected str, bytes or os.PathLike object, not NoneType
File "/ei/software/testing/gmc/0.7/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py", line 71, in <module>
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/posixpath.py", line 146, in basename
so unless I'm not setting this correctly we might have a bug.
Hi @swarbred ,
yea that was a bug, patched in 0.7.1 (a4fda50).
Cheers Christian
HI @cschu @gemygk
Here is what I was thinking for the presentation of the BUSCO results.
Gemy has been extracting this info, so please gemy correct / point out any issues
The short_summary.txt gives the overall breakdown of complete, missing etc BUSCOs The issue with this is that it's based on BUSCOs linked to transcripts , so the duplicated and single copy busco counts don't give an accurate refection of if the gene set contains duplicated copies of BUSCO genes. A gene with 2 splice variants potentially will be shown as duplicated in these results, which is not very useful to help assess a gene set.
Gemy has been using the full_table.tsv to summarise these counts at the gene level. The full_table.tsv gives the sequence (transcript) name in col3, often the gene transcript association is part of the name i.e. a .1 .2 suffix to indicate a transcript, however this is not always going to be the case i.e. ensembl models. I assume in those cases (or perhaps all cases) then Gemy has created a gene transcript mapping file from the original input gtf/gff (i.e. what is passed to prepare) and would use this to get the "correct" gene level numbers for single copy and duplicated genes.
I realise this is a pain, without this though the duplicated and single copy numbers are useless. Unfortunately I don't think you can use the mikado_prepare.gtf file as mikado really does not care about genes in the input so the gene_id you see in this file is just the transcript_id with .gene added.
We have also been breaking down the duplicated counts into, 2 copies, 3 copies and +4 copies. this is useful for polyploid species like wheat where we were expecting 3 genes for each BUSCO.
for info the busco files relating to this are at /ei/workarea/group-ga/Projects/CB-GENANNO-436_Koala/Analysis/busco-3.0
counts are after summarising at the gene level.
It would be great to have all the BUSCO runs brought together into a table
Col1 Busco Plots Complete (single copy) Complete (2 copies) Complete (3 copies) Complete (4+ copies) Complete Duplicated Fragmented Missing Total
col2-colx giving the counts for each run (e.g. genome, transcripts_prepare-label1, transcripts_prepare-label2, proteins_prepare-label1, proteins_prepare-label2, transcripts_final, proteins_final
It would also be useful to determine complete, missing, fragment counts across all transcripts_prepare and proteins_prepare busco results. I will comment on this more later but this would indicate the best we could hope to achieve from the picking.
If we have these numbers then there are some useful comparisons we can make to indicate if the final set of gene models is of good quality.
The transcripts_final and proteins_final results should be an improvement on the genome run (and never worse) The transcripts_final and proteins_final BUSCO results should be similar (if the transcript run is substantially inferior then the ORF calling is not good) If we determine the best BUSCO results across all transcripts_prepare runs and separately across all proteins_prepare runs then the transcripts_final and proteins_final results should be similar to this.
Comment and let me know how much a pain this is.
Soo, like this? Just wondering if I have the calculations right.
Busco Plots transcripts_final transcripts_prepare/Cotton_coding transcripts_prepare/Mikado_gold transcripts_prepare/Castor_bean_coding transcripts_prepare/Turnip_mustard_coding transcripts_prepare/Capsella_grandiflora_coding transcripts_prepare/Sweet_orange_coding transcripts_prepare/Poplar_coding transcripts_prepare/MELAZ155640_run1_wRNA transcripts_prepare/MELAZ155640_run3_woRNA transcripts_prepare/Purple_osier_willow_coding transcripts_prepare/Thale_cress_coding transcripts_prepare/Cassava_coding transcripts_prepare/Clementine_coding transcripts_prepare/MELAZ155640_run2_woRNA proteins_final genome proteins_prepare/Cotton_coding proteins_prepare/Mikado_gold proteins_prepare/Castor_bean_coding proteins_prepare/Turnip_mustard_coding proteins_prepare/Capsella_grandiflora_coding proteins_prepare/Sweet_orange_coding proteins_prepare/Poplar_coding proteins_prepare/MELAZ155640_run1_wRNA proteins_prepare/MELAZ155640_run3_woRNA proteins_prepare/Purple_osier_willow_coding proteins_prepare/Thale_cress_coding proteins_prepare/Cassava_coding proteins_prepare/Clementine_coding proteins_prepare/MELAZ155640_run2_woRNA
Complete (single copy) 1285 1 1361 140 0 2 10 0 1480 1316 2 0 0 2 1235 1277 1535 1 1350 141 0 2 10 0 1461 1315 2 0 0 2 1221
Complete (2 copies) 241 0 37 11 0 0 0 0 45 28 0 0 0 0 24 244 47 0 38 10 0 0 0 0 47 33 0 0 0 0 25
Complete (3 copies) 22 0 3 5 0 1 0 0 2 3 0 0 0 0 6 19 3 0 0 2 0 0 0 0 1 1 0 0 0 0 2
Complete (4+ copies) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 2 0 4 4 0 1 0 0 2 3 0 0 0 0 7
Complete 1285 1 1361 140 0 2 10 0 1480 1316 2 0 0 2 1235 1277 1535 1 1350 141 0 2 10 0 1461 1315 2 0 0 2 1221
Duplicated 263 0 40 16 0 1 0 0 47 31 0 0 0 0 30 269 52 0 42 16 0 1 0 0 50 37 0 0 0 0 34
Fragmented 39 1 18 13 2 1 6 0 49 19 1 0 2 3 39 47 7 1 26 15 2 1 6 0 65 20 1 0 2 3 53
Missing 27 1612 195 1445 1612 1610 1598 1614 38 248 1611 1614 1612 1609 310 21 20 1612 196 1442 1612 1610 1598 1614 38 242 1611 1614 1612 1609 306
Total 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614 1614
@gemygk If you'd like to check my math in gmc/scripts/analyse_busco.py.
This would be available in gmc-1.0, btw.
@cschu
I will let gemy comment on the count you posted but I can more easily see if these look correct running on a real project (where I have certain expectations how they should look), I take it these are gathered at the end, i.e. I could take one of my completed 0.8 runs and just rerun with 1.0 from some point after the final pick stage (if I touch a file).
Yes @swarbred I can calculate what we have from GMC 0.8 BUSCO results and compare them when we have GMC 1.0 results table
@swarbred I would save the results from that 0.8 run before running with 1.0. I don't think I have changed anything that would trigger a rerun, but I am not 100% certain that this will not re-run serialise and pick.
Hi @cschu
A parsing issue with tx2gene
One of the gff files is not parsed resulting in an error (below), is this due to the presence of pipes in the notes? (i will reformat and check)
Oreochromis_mossambicus_EIV1_scaffold_0000001 Mikado_loci mRNA 90425 110432 37 - . ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G6.4;Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G6;Name=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G6.4;note=ENSONIP00000021500.2|alnLen:719|len:719|cov:100.00|id:99.16|score:3555;alias=Oreochromis_niloticus_ENSONIP00000021500.2.m1;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13;canonical_number=13;canonical_proportion=1.0;ccode=j;cds_padded=True;deletions=0;frameshifts=0;has_start_codon=True;has_stop_codon=False;insertions=0;is_reference=True;padded=True;primary=False;retained_intron=False
210 if row[2] == "mRNA":
211 attr = dict(item.split("=") for item in row[8].split(";"))
212 print("{}_{}".format(row[1], attr["ID"]), attr["Parent"], file=tx2gene_out, flush=True, sep="\t")
Different point but are we only extracting mRNA lines? Mikado output (i.e. one of the inputs to GMC) will have ncRNAs as well (which may not be ncRNAs and could still come up in the BUSCO transcript runs). What would happen if we did not extract a model with tx2gene (i.e. a non mRNA model) then this model came up in the BUSCO output ?
Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 30 Rules claiming more threads will be scaled down. Job counts: count jobs 1 gmc_generate_tx2gene_maps 1
[Sat Apr 25 11:50:35 2020] rule gmc_generate_tx2gene_maps: input: mikado.loci.onlyrefproteins.gff3 output: GMC-1.0_run1/tx2gene/mikado_protein_run.tx2gene jobid: 0 wildcards: run=mikado_protein_run threads: 30 resources: mem_mb=4096
[Sat Apr 25 11:50:36 2020] Error in rule gmc_generate_tx2gene_maps: jobid: 0 output: GMC-1.0_run1/tx2gene/mikado_protein_run.tx2gene
RuleException: ValueError in line 211 of /ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk: dictionary update sequence element #8 has length 3; 2 is required File "/ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk", line 211, in __rule_gmc_generate_tx2gene_maps File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/concurrent/futures/thread.py", line 56, in run Removing output files of failed job gmc_generate_tx2gene_maps since they might be corrupted: GMC-1.0_run1/tx2gene/mikado_protein_run.tx2gene Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
Hi @swarbred,
Can you point me at the input gff? It is not an issue of having pipes in the Note, there must be an extra "=" or something.
Concerning mRNA/ncRNA, I will add in ncRNA when I patch this issue.
I've copied to /ei/workarea/group-ga/Tmp/mikado.loci.onlyrefproteins.gff3
This was generated by using one of the mikado util scripts that extracts from a gff based on id, so it's not a straight mikado output but dont think I've manipulated this further (it's possible though).
The problem seems to be the ccid== in line 9981, this will split into ("ccid", "", ""), which cannot be processed into a dictionary item.
9981 ['ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G168.4', 'Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G168', 'Name=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G168.4', 'note=ENSACLP00000031356.1|alnLen:497|len:497|cov:100.00|id:98.98|score:2370', 'alias=Astatotilapia_calliptera_ENSACLP00000031356.1.m1', 'canonical_junctions=1,2,3,4,5,6,8,9,10,11', 'canonical_number=10', 'canonical_proportion=0.9090909090909091', 'ccode==', 'cds_padded=True', 'deletions=0', 'frameshifts=0', 'has_start_codon=True', 'has_stop_codon=False', 'insertions=0', 'is_reference=True', 'padded=True', 'primary=False', 'retained_intron=False']
Traceback (most recent call last):
File "test.py", line 8, in <module>
attr = dict(item.split("=") for item in row[8].split(";"))
ValueError: dictionary update sequence element #8 has length 3; 2 is required
We had the same issue with this in mikado at some point, this is the classcode id from mikado. Depending on how mikado is run this could be present multiple times. You wont see it in a normal mikado run but this file represents the models selected from the protein alignments and this will always contain these class codes. I can of course just strip from the input but it might be useful to clean this in gmc.
You mean there will potentially be multiple instances of "cclass==" in a file or in a record?
I'm happy to clean that, but why are they written like this in the first place? This seems to introduce an extra level of complexity when the attrib field cannot be parsed properly as key, value pairs without putting in extra parsing effort.
You mean there will potentially be multiple instances of "cclass==" in a file or in a record?
In the file
we have a classcode of equals and = was used to represent this, however models with this classcode were never output so we never had any issue with them being a tag value pair in a gff. Then we added a feature to mikado which made models with this ccode valid. In retrospect we should have chosen something else.
"gmc_generate_tx2gene_maps": {
"memory": "4096",
"cores": "1",
"J": "gmc_generate_tx2gene_maps"
}
config indicates to use 1 core for gmc_generate_tx2gene_maps but on running uses 30
26547288 swarbred ei-cb 1 30 4G RUNNING 2020-04-25T11:50:09 2020-04-25T11:50:10 0:08 t128n89 gmc_generate_tx2gene
Meh... :
threads:
HPC_CONFIG.get_cores("gmc_mikado_prepare")
Try with gmc-1.0.1.
Concerning mRNA/ncRNA, I will add in ncRNA when I patch this issue.
In our GFF files we should only have ncRNA and mRNA, but there may be other features that people could potentially supply. mikado may work with a broader range of feature types, I think it only really cares about exon lines but I will check with luca. Potentially (only thinking out loud) you could derive the mapping from mRNA to gene by going from exon to it's parent to the parent of this feature (i.e. gene) that way you don't need to know what the specific feature type is.
Unfortunately there are few variations here, we don't necessarily want to care about supporting everything, so it might just require some specification in the documentation about what is accepted.
My run that had a gtf input generated no tx2gene output, we should be able to accept gtf files as these are a common format and mikado works with these (it's probably better to provide these than gff) and they are easier to parse .
a gtf would look like
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci transcript 63335 65415 20.620000839233398 + . transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci exon 63335 63522 . + . transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci exon 63593 65415 . + . transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci CDS 63474 63522 . + 0 transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci CDS 63593 65115 . + 2 transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Unfortunately the 1.0 run generated an error at the point of generating the summary. the busco short summary files appear to be present as are the tx2gene files (error below)
In this run the busco genome output was given as a command line input i.e not generated in this gmc run (came from an earlier run). I mention it incase this is relevant though looking at your test busco summary output no genome run was included, so perhaps that's not implemented yet as it's a little different given no tx2gene file can be generated upfront. The gene2transcript linking could just be derived from the transcript names (gemy i assume does this, though the augustus run may just generate a single model per gene).
[Sat Apr 25 21:29:52 2020]
localrule busco_summary:
input: /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.8/2.0rc6_d094f99_CBG/GMC_1.0_run1_d094f99/busco/runs/proteins_prepare/U
output: GMC_1.0_run1_d094f99/busco_final_table.tsv
jobid: 118
reason: Missing output files: GMC_1.0_run1_d094f99/busco_final_table.tsv; Input files updated by another job: /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_U
ESC[33mJob counts:
count jobs
1 busco_summary
1ESC[0m
ESC[32m[Sat Apr 25 21:29:55 2020]ESC[0m
ESC[31mError in rule busco_summary:ESC[0m
ESC[31m jobid: 0ESC[0m
ESC[31m output: GMC_1.0_run1_d094f99/busco_final_table.tsvESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
KeyError in line 1135 of /ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk:
'mikado.loci.ptn_coding_mikado.Uromyces_beticola_EI_v1.1_contig_000714G1.3'
File "/ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk", line 1135, in __rule_busco_summary
File "/ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/scripts/analyse_busco.py", line 20, in read_full_table
File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/concurrent/futures/thread.py", line 56, in runESC[0m
ESC[31mExiting because a job execution failed. Look above for error messageESC[0m
Trying to restart job 118.
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 100}
Ready jobs (1):
busco_summary
Selected jobs (1):
busco_summary
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 99}
Hi @cschu
The above issue is caused by the wrong "label" being prefixed to the transcript id
print("{}_{}".format(row[1], attr["ID"]), attr["Parent"], file=tx2gene_out, flush=True, sep="\t")
Currently you prefix col2 to the transcript name, but this should be the label as found in the list.txt file which mikado uses as the prefix for the prepare transcripts (and which you are using to set the busco directory name e.g. under busco/runs/transcripts_prepare).
This worked for you in your test set as the source column in the gff (col2) was the same as the label in the list.txt file, however this will not always be true.
In my case as the label and source were different then the transcript names don't agree between the tx2gene file and the busco full_table.tsv file.
Once I manually update the tx2gene file this now generates the busco_final_table.tsv file.
from a quick look, the final transcript and protein numbers are not correct, these look the same as the BUSCO reported figures
this is my run
|:-----------------------|:------------------|:---------------------------------------|:----------------------------------|:----------------------------------|:----------------
| Busco Plots | transcripts_final | transcripts_prepare/mikado_protein_run | transcripts_prepare/augustus_run3 | transcripts_prepare/augustus_run1 | transcripts_prep
| Complete (single copy) | 2064 | 2495 | 3363 | 3329 | 3162
| Complete (2 copies) | 878 | 34 | 71 | 82 | 70
| Complete (3 copies) | 492 | 2 | 1 | 5 | 4
| Complete (4+ copies) | 0 | 0 | 0 | 0 | 0
| Complete | 2064 | 2495 | 3363 | 3329 | 3162
| Duplicated | 1370 | 36 | 72 | 87 | 74
| Fragmented | 91 | 63 | 98 | 104 | 132
| Missing | 115 | 1046 | 107 | 120 | 272
| Total | 3640 | 3640 | 3640 | 3640 | 3640
134357at7898 Duplicated OREMO8127_EIv1.0_0133490.1 200.4 104 https://www.orthodb.org/v10?query=134357at7898 LYR motif-containing protein 1
134357at7898 Duplicated OREMO8127_EIv1.0_0133490.2 200.3 104 https://www.orthodb.org/v10?query=134357at7898 LYR motif-containing protein 1
This is a busco gene counted as duplicated but it's to two splice variants of the same gene, your current results are still counting this as duplicated
We can just remove the .1 .2 etc from the sequence name to get the gene name then collapse these and count the number of unique busco-gene names >= to 2 to get the number of duplicated BUSCOs
For transcripts_final the above table gives 1370 (same as BUSCO raw results), i.e.
awk '$2 == "Duplicated"' full_table.tsv |cut -f1 |sort -u |wc -l
to get the correct Duplicated counts we want
awk '$2 == "Duplicated"' full_table.tsv |cut -f1,3 |cut -d . -f1,2 |sort -u |cut -f1 |sort -n |uniq -c |sed 's/^ *//g' |awk '$1 >= 2' |wc -l;
which is 71 duplicated BUSCOs
minor one while I remember the number reported against Complete would make sense to be all complete buscos i.e. the sum of single copy + 2+ 3+ etc.
@cschu
we also need the BUSCO --limit
option to be configurable (correct me if it already exists). By default, we are running BUSCO with --limit 3
for all the modes (genome, proteins, transcripts) and report in the summary Complete (4+ copies)
which is a bit misleading.
--limit REGION_LIMIT How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
@gemygk These can be set in the config file, params: busco: busco_type sections. Currently, there is only --offline as default parameter. I will put a --limit 4 as default, but you can amend/PR this as you wish.
The 4+ I took from @swarbred 's example above.
Thanks @cschu, yes for now --limit 4
could be set as default.
Hi @swarbred,
minor one while I remember the number reported against Complete would make sense to be all complete buscos i.e. the sum of single copy + 2+ 3+ etc.
Ok, just for my understanding: duplicated will be a subset of complete?
Currently it is reported: complete=1 copy, duplicated=2+ copies, i will amend.
Regarding the false duplicates in the final_ runs: that transcript-number stripping is exactly what I have implemented to happen for post pick runs, but forgot to remove a line further downstream that overwrites this with the full gene id. This should be fixed in 1.0.2, which is currently undergoing a test run and should be available by the end of today.
Ok, just for my understanding: duplicated will be a subset of complete?
Yes correct, complete is then reporting the same kind of figure as the raw busco output C= S+D.
@gemygk 👍
The output table categories are controlled separately in
misc: busco_max_copy_number: 4
I don't want to complicate the config file with yaml references, so the --limit
and table outputs have to be set separately.
Thanks @cschu, yes for now --limit 4 could be set as default.
Hi @cschu Just to clarify are these updates are included in 1.02? What about https://github.com/EI-CoreBioinformatics/gmc/issues/17#issuecomment-619552280 ?
Hi @swarbred,
unless I have missed something, everything from this discussion (except gtf support) and the last point from issue #18 (te_score decimals) should be addressed in 1.0.2.
@cschu I think your parsing of the gff file for tx2gene errors when a mRNA line just has two tags and doesn't include a ; at the end of the line.
i.e. a gff with lines like below gave an error
Oreochromis_mossambicus_EIV1_scaffold_0000001 Mikado_loci mRNA 4478 8288 37 + . ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1.1;Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1
adding ; to the end of the line and the id mapping file was generated correctly
Oreochromis_mossambicus_EIV1_scaffold_0000001 Mikado_loci mRNA 4478 8288 37 + . ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1.1;Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1;
I have addressed this now (1.0.3). I had to change the parser so that instances of "ccode==" can be processed without explicitly modifying the attrib column, but that new parser was not robust against things like the missing semicolon at the end.
@cschu @gemygk I would like to get your views on adding rules for BUSCO, running against the genome, final output files and running against the prepare output then creating an external metric based on the output.
We (+others) will always run BUSCO on the final models so for our own convenience it would be good to add this to GMC. It's clear from looking at Koala that a BUSCO "complete" assignment doesn't always indicate the model is really complete (some are clear fragments) but it's a useful qc metric and the relative comparison between the genomic, transcript and protein BUSCO runs is informative as is comparing the prepare output BUSCO run to the final selected models. I would also like to use the BUSCO assignment in the mikado pick scoring. If a large multiplier is applied this would clearly make the final set of models look better from a BUSCO perspective, which could be viewed as artificial (as we are using busco to select the models and also evaluate them) but it's still a useful option to have if it helps select a more accurate set of models (and for mammalian genomes ~4000 BUSCOs is a big chunk of the geneset. As default we wouldn't apply a large multiplier so it's just another couple of metrics among many.
The idea would be to
1) Run directly against the genome (BUSCO genome mode) 2) Run against the mikado prepare transcripts AND against the extracted mikado prepare proteins (so another gffread extraction). Create two external metrics based on this output (based on the two runs) e.g. assigning 1 to any complete or duplicated model and 0.25 to any fragment . One question would be if it's best to run this against the full set of prepare transcripts and proteins or to split these files first (based on the mikado label) and run against each separately and combine the results. This might depend on the BUSCO --limit setting. We can test this by running BUSCO on the mikado prepare gmc file for koala. 3) Run against the final transcripts and proteins 4) Combine results into a tsv file, plus calculate a few other values based on the results.
Any issues?