EI-CoreBioinformatics / minos

The labyrinth king judges your gene models.
GNU Lesser General Public License v3.0
9 stars 1 forks source link

Add BUSCO rules #17

Closed swarbred closed 4 years ago

swarbred commented 4 years ago

@cschu @gemygk I would like to get your views on adding rules for BUSCO, running against the genome, final output files and running against the prepare output then creating an external metric based on the output.

We (+others) will always run BUSCO on the final models so for our own convenience it would be good to add this to GMC. It's clear from looking at Koala that a BUSCO "complete" assignment doesn't always indicate the model is really complete (some are clear fragments) but it's a useful qc metric and the relative comparison between the genomic, transcript and protein BUSCO runs is informative as is comparing the prepare output BUSCO run to the final selected models. I would also like to use the BUSCO assignment in the mikado pick scoring. If a large multiplier is applied this would clearly make the final set of models look better from a BUSCO perspective, which could be viewed as artificial (as we are using busco to select the models and also evaluate them) but it's still a useful option to have if it helps select a more accurate set of models (and for mammalian genomes ~4000 BUSCOs is a big chunk of the geneset. As default we wouldn't apply a large multiplier so it's just another couple of metrics among many.

The idea would be to

1) Run directly against the genome (BUSCO genome mode) 2) Run against the mikado prepare transcripts AND against the extracted mikado prepare proteins (so another gffread extraction). Create two external metrics based on this output (based on the two runs) e.g. assigning 1 to any complete or duplicated model and 0.25 to any fragment . One question would be if it's best to run this against the full set of prepare transcripts and proteins or to split these files first (based on the mikado label) and run against each separately and combine the results. This might depend on the BUSCO --limit setting. We can test this by running BUSCO on the mikado prepare gmc file for koala. 3) Run against the final transcripts and proteins 4) Combine results into a tsv file, plus calculate a few other values based on the results.

Any issues?

cschu commented 4 years ago

Can easily take the rules from qaa, so should be no issue from the technical side. Not sure how we can get @gemygk's BUSCO modification when it comes down to doing a conda install, but other than that, I don't see a problem. Splitting or not splitting also -- at first thought -- should be fine.

gemygk commented 4 years ago

@swarbred The idea of adding BUSCO as a metric is great. But as you mentioned it could be viewed as artificial, but again given that we have a model to represent the BUSCO set and favouring it has its merits.

I think, apart from the points you mentioned, are we planning to incorporate the BUSCO models that are unique to the genome and pass them to integration stage?

Running BUSCO on protein set is quite quick but for transcripts, it takes some time. So splitting the transcripts and later combining them would be the ideal choice.

For example: The runtime for Koala (93698 transcripts) Protein run - ~30 mins using 32 threads Transcriptome run - ~ 5hrs 30 mins using 32 threads

swarbred commented 4 years ago

I think, apart from the points you mentioned, are we planning to incorporate the BUSCO models that are unique to the genome and pass them to integration stage?

There is merit in that but i wasn't thinking of tackiling this at the current time.

Running BUSCO on protein set is quite quick but for transcripts, it takes some time. So splitting the transcripts and later combining them would be the ideal choice.

From discussing this it seems the BUSCO run against the genome is not the most robust and has a substantially longer run time, both are reasons for not incorporating this directly into gmc. It is however something we will always run (so for our convenience this would be nice to have) . Certainly the default would be not to run this, an alternative would be to provide the BUSCO output form a genome run as an input., just to enable us to bring the BUSCO results from a genome run, a mikado prepare stage (transcript and protein) and final annotation (transcript and protein) together and make some automated comparisons. The final alternative is to just exclude it completely.

The transcriptome run runtime suggests that it would be best to split this by label and then run individually, that also provides more useful info about the individual gene sets.

Which BUSCO runs are run would need to be configurable and the default would likely be to just run the proteins runs on the prepare output and final annotations.

cschu commented 4 years ago

The transcriptome run runtime suggests that it would be best to split this by label and then run individually, that also provides more useful info about the individual gene sets.

What does "split this by label" mean?

swarbred commented 4 years ago

Sorry, i mean for the busco transcript run on the mikado prepare output , to split the mikado prepare fasta file in to seperate files based on the label that has been assigned to the transcripts.i.e the label in the list file used in mikado configure . Then run busco on each of these.We should also split the fasta for the busco protein run on the prepare output.

Running seperately will give us busco results for each annotation set contained in the prepare output. Those individual results are useful but we would also want to present the numbers contained duplicated missing across all sets so this can be compared to the busco run on the final annotation (which is a single transcript and protein run i.e no spliting).

Sent from Outlook Mobilehttps://aka.ms/blhgte


From: Christian Schudoma notifications@github.com Sent: Thursday, April 9, 2020 12:45:45 AM To: EI-CoreBioinformatics/gmc gmc@noreply.github.com Cc: David Swarbreck (EI) David.Swarbreck@earlham.ac.uk; Mention mention@noreply.github.com Subject: Re: [EI-CoreBioinformatics/gmc] Add BUSCO rules (#17)

The transcriptome run runtime suggests that it would be best to split this by label and then run individually, that also provides more useful info about the individual gene sets.

What does "split this by label" mean?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/EI-CoreBioinformatics/gmc/issues/17#issuecomment-611250057, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAY5ZYIYJH6JT2UCGQ27J3DRLUECTANCNFSM4L7GQXVA.

cschu commented 4 years ago

@swarbred @gemygk

Minor design decision: do you want the used busco levels as

a) comma-separated list

--busco-level proteins,transcripts,genome

or

b) multiple parameters

--busco-proteins, --busco-transcripts, --busco-genome or

c) the same parameter that can be specified multiple times

--busco-level proteins --busco-level transcripts --busco-level genome ?

gemygk commented 4 years ago

@cschu @swarbred

I would go with option a)

Also, what about --busco-mode proteins,transcriptome,genome to keep inline with the BUSCO parameters and options

cschu commented 4 years ago

Yea, agreed. But I was also thinking to allow multiple aliases for the different modes (busco-mode-names, t, p, g, etc., for lazy typists ;))

Also, what about --busco-mode proteins,transcriptome,genome to keep inline with the BUSCO parameters and options

swarbred commented 4 years ago

No strong preference but I would vote a) perhaps with an "all" option for proteins,transcripts,genome, I assume not specifying --busco-level would turn this off

cschu commented 4 years ago

I would add "all", "none"/"off" and the default setting would be just proteins.

No strong preference but I would vote a) perhaps with an "all" option for proteins,transcripts,genome, I assume not specifying --busco-level would turn this off

cschu commented 4 years ago

Status right now is : 0.6.1 has proteins (prepare, final), transcripts (prepare, final) and genome, but genome doesn't work yet (where I don't know whether this is because of the file system issues). Currently, these are just run and not used / summarised, but that will be next.

cschu commented 4 years ago

@swarbred @gemygk I have now fine-tuned this to Busco 4 (odb10 databases are in group-pb/BUSCO-databases/odb10). There are some differences to Busco 3 regarding command line usage. Supporting both versions should be possible but would require some remodeling of the snakemake rules. Unless you really want/need it, I would drop the BUSCO 3 support.

cschu commented 4 years ago

The output file/directory structure for Busco4 is also quite different.

swarbred commented 4 years ago

@cschu I'm not familiar with the BUSCO3 / 4 differences but I don't feel we need to support both, BUSCO 4 would be fine.

cschu commented 4 years ago

@swarbred If we were to provide a premade genome busco run, what data would that encompass? Only the summary stats?

swarbred commented 4 years ago

@cschu We will need the

short_summary.*.txt and full_table.tsv

swarbred commented 4 years ago

Hi @cschu

So my run for GENANNO-457 completed with 0.7, everything busco completed and I'll add some details later regarding summary info we want to reformat/extract. This is really useful, so thanks for adding.

While looking into another issue (will comment in a separate ticket) I tried rerunning your GMC_dev_testdata_MAZ test set with a few modification. When trying to turn off the busco analysis I just removed --busco-lineage and added --busco-level off

this then gives an error

Full Traceback (most recent call last):
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/__init__.py", line 496, in snakemake
    snakefile, overwrite_first_rule=True, print_compilation=print_compilation
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/site-packages/snakemake/workflow.py", line 987, in include
    exec(compile(code, snakefile, "exec"), self.globals)
  File "/ei/software/testing/gmc/0.7/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py", line 71, in <module>
    BUSCO_LINEAGE = os.path.basename(config["busco_analyses"]["lineage"])
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/posixpath.py", line 146, in basename
    p = os.fspath(p)
TypeError: expected str, bytes or os.PathLike object, not NoneType

TypeError in line 71 of /ei/software/testing/gmc/0.7/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py:
expected str, bytes or os.PathLike object, not NoneType
  File "/ei/software/testing/gmc/0.7/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk.py", line 71, in <module>
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/posixpath.py", line 146, in basename

so unless I'm not setting this correctly we might have a bug.

cschu commented 4 years ago

Hi @swarbred ,

yea that was a bug, patched in 0.7.1 (a4fda50).

Cheers Christian

swarbred commented 4 years ago

HI @cschu @gemygk

Here is what I was thinking for the presentation of the BUSCO results.

Gemy has been extracting this info, so please gemy correct / point out any issues

The short_summary.txt gives the overall breakdown of complete, missing etc BUSCOs The issue with this is that it's based on BUSCOs linked to transcripts , so the duplicated and single copy busco counts don't give an accurate refection of if the gene set contains duplicated copies of BUSCO genes. A gene with 2 splice variants potentially will be shown as duplicated in these results, which is not very useful to help assess a gene set.

Gemy has been using the full_table.tsv to summarise these counts at the gene level. The full_table.tsv gives the sequence (transcript) name in col3, often the gene transcript association is part of the name i.e. a .1 .2 suffix to indicate a transcript, however this is not always going to be the case i.e. ensembl models. I assume in those cases (or perhaps all cases) then Gemy has created a gene transcript mapping file from the original input gtf/gff (i.e. what is passed to prepare) and would use this to get the "correct" gene level numbers for single copy and duplicated genes.

I realise this is a pain, without this though the duplicated and single copy numbers are useless. Unfortunately I don't think you can use the mikado_prepare.gtf file as mikado really does not care about genes in the input so the gene_id you see in this file is just the transcript_id with .gene added.

We have also been breaking down the duplicated counts into, 2 copies, 3 copies and +4 copies. this is useful for polyploid species like wheat where we were expecting 3 genes for each BUSCO.

see http://jira.earlham.ac.uk/secure/attachment/130355/CB-GENANNO-436_Koala.BUSCO_transcripts-proteins.07Apr2020.xlsx

for info the busco files relating to this are at /ei/workarea/group-ga/Projects/CB-GENANNO-436_Koala/Analysis/busco-3.0

counts are after summarising at the gene level.

It would be great to have all the BUSCO runs brought together into a table

Col1 Busco Plots Complete (single copy) Complete (2 copies) Complete (3 copies) Complete (4+ copies) Complete Duplicated Fragmented Missing Total

col2-colx giving the counts for each run (e.g. genome, transcripts_prepare-label1, transcripts_prepare-label2, proteins_prepare-label1, proteins_prepare-label2, transcripts_final, proteins_final

It would also be useful to determine complete, missing, fragment counts across all transcripts_prepare and proteins_prepare busco results. I will comment on this more later but this would indicate the best we could hope to achieve from the picking.

If we have these numbers then there are some useful comparisons we can make to indicate if the final set of gene models is of good quality.

The transcripts_final and proteins_final results should be an improvement on the genome run (and never worse) The transcripts_final and proteins_final BUSCO results should be similar (if the transcript run is substantially inferior then the ORF calling is not good) If we determine the best BUSCO results across all transcripts_prepare runs and separately across all proteins_prepare runs then the transcripts_final and proteins_final results should be similar to this.

Comment and let me know how much a pain this is.

cschu commented 4 years ago

Soo, like this? Just wondering if I have the calculations right.

Busco Plots     transcripts_final       transcripts_prepare/Cotton_coding       transcripts_prepare/Mikado_gold transcripts_prepare/Castor_bean_coding  transcripts_prepare/Turnip_mustard_coding       transcripts_prepare/Capsella_grandiflora_coding transcripts_prepare/Sweet_orange_coding transcripts_prepare/Poplar_coding       transcripts_prepare/MELAZ155640_run1_wRNA       transcripts_prepare/MELAZ155640_run3_woRNA      transcripts_prepare/Purple_osier_willow_coding  transcripts_prepare/Thale_cress_coding  transcripts_prepare/Cassava_coding      transcripts_prepare/Clementine_coding   transcripts_prepare/MELAZ155640_run2_woRNA      proteins_final  genome  proteins_prepare/Cotton_coding  proteins_prepare/Mikado_gold    proteins_prepare/Castor_bean_coding     proteins_prepare/Turnip_mustard_coding  proteins_prepare/Capsella_grandiflora_coding    proteins_prepare/Sweet_orange_coding    proteins_prepare/Poplar_coding  proteins_prepare/MELAZ155640_run1_wRNA  proteins_prepare/MELAZ155640_run3_woRNA proteins_prepare/Purple_osier_willow_coding     proteins_prepare/Thale_cress_coding     proteins_prepare/Cassava_coding proteins_prepare/Clementine_coding      proteins_prepare/MELAZ155640_run2_woRNA
Complete (single copy)  1285    1       1361    140     0       2       10      0       1480    1316    2       0       0       2       1235    1277    1535    1       1350    141     0       2       10      0       1461    1315    2       0       0       2       1221
Complete (2 copies)     241     0       37      11      0       0       0       0       45      28      0       0       0       0       24      244     47      0       38      10      0       0       0       0       47      33      0       0       0       0       25
Complete (3 copies)     22      0       3       5       0       1       0       0       2       3       0       0       0       0       6       19      3       0       0       2       0       0       0       0       1       1       0       0       0       0       2
Complete (4+ copies)    0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       6       2       0       4       4       0       1       0       0       2       3       0       0       0       0       7
Complete        1285    1       1361    140     0       2       10      0       1480    1316    2       0       0       2       1235    1277    1535    1       1350    141     0       2       10      0       1461    1315    2       0       0       2       1221
Duplicated      263     0       40      16      0       1       0       0       47      31      0       0       0       0       30      269     52      0       42      16      0       1       0       0       50      37      0       0       0       0       34
Fragmented      39      1       18      13      2       1       6       0       49      19      1       0       2       3       39      47      7       1       26      15      2       1       6       0       65      20      1       0       2       3       53
Missing 27      1612    195     1445    1612    1610    1598    1614    38      248     1611    1614    1612    1609    310     21      20      1612    196     1442    1612    1610    1598    1614    38      242     1611    1614    1612    1609    306
Total   1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614    1614
cschu commented 4 years ago

@gemygk If you'd like to check my math in gmc/scripts/analyse_busco.py.

cschu commented 4 years ago

This would be available in gmc-1.0, btw.

swarbred commented 4 years ago

@cschu

I will let gemy comment on the count you posted but I can more easily see if these look correct running on a real project (where I have certain expectations how they should look), I take it these are gathered at the end, i.e. I could take one of my completed 0.8 runs and just rerun with 1.0 from some point after the final pick stage (if I touch a file).

gemygk commented 4 years ago

Yes @swarbred I can calculate what we have from GMC 0.8 BUSCO results and compare them when we have GMC 1.0 results table

cschu commented 4 years ago

@swarbred I would save the results from that 0.8 run before running with 1.0. I don't think I have changed anything that would trigger a rerun, but I am not 100% certain that this will not re-run serialise and pick.

swarbred commented 4 years ago

Hi @cschu

A parsing issue with tx2gene

One of the gff files is not parsed resulting in an error (below), is this due to the presence of pipes in the notes? (i will reformat and check)

Oreochromis_mossambicus_EIV1_scaffold_0000001   Mikado_loci mRNA    90425   110432  37  -   .   ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G6.4;Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G6;Name=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G6.4;note=ENSONIP00000021500.2|alnLen:719|len:719|cov:100.00|id:99.16|score:3555;alias=Oreochromis_niloticus_ENSONIP00000021500.2.m1;canonical_junctions=1,2,3,4,5,6,7,8,9,10,11,12,13;canonical_number=13;canonical_proportion=1.0;ccode=j;cds_padded=True;deletions=0;frameshifts=0;has_start_codon=True;has_stop_codon=False;insertions=0;is_reference=True;padded=True;primary=False;retained_intron=False
    210                                 if row[2] == "mRNA":
    211                                         attr = dict(item.split("=") for item in row[8].split(";"))
    212                                         print("{}_{}".format(row[1], attr["ID"]), attr["Parent"], file=tx2gene_out, flush=True, sep="\t")

Different point but are we only extracting mRNA lines? Mikado output (i.e. one of the inputs to GMC) will have ncRNAs as well (which may not be ncRNAs and could still come up in the BUSCO transcript runs). What would happen if we did not extract a model with tx2gene (i.e. a non mRNA model) then this model came up in the BUSCO output ?

Building DAG of jobs... Using shell: /usr/bin/bash Provided cores: 30 Rules claiming more threads will be scaled down. Job counts: count jobs 1 gmc_generate_tx2gene_maps 1

[Sat Apr 25 11:50:35 2020] rule gmc_generate_tx2gene_maps: input: mikado.loci.onlyrefproteins.gff3 output: GMC-1.0_run1/tx2gene/mikado_protein_run.tx2gene jobid: 0 wildcards: run=mikado_protein_run threads: 30 resources: mem_mb=4096

[Sat Apr 25 11:50:36 2020] Error in rule gmc_generate_tx2gene_maps: jobid: 0 output: GMC-1.0_run1/tx2gene/mikado_protein_run.tx2gene

RuleException: ValueError in line 211 of /ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk: dictionary update sequence element #8 has length 3; 2 is required File "/ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk", line 211, in __rule_gmc_generate_tx2gene_maps File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/concurrent/futures/thread.py", line 56, in run Removing output files of failed job gmc_generate_tx2gene_maps since they might be corrupted: GMC-1.0_run1/tx2gene/mikado_protein_run.tx2gene Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

cschu commented 4 years ago

Hi @swarbred,

Can you point me at the input gff? It is not an issue of having pipes in the Note, there must be an extra "=" or something.

Concerning mRNA/ncRNA, I will add in ncRNA when I patch this issue.

swarbred commented 4 years ago

I've copied to /ei/workarea/group-ga/Tmp/mikado.loci.onlyrefproteins.gff3

This was generated by using one of the mikado util scripts that extracts from a gff based on id, so it's not a straight mikado output but dont think I've manipulated this further (it's possible though).

cschu commented 4 years ago

The problem seems to be the ccid== in line 9981, this will split into ("ccid", "", ""), which cannot be processed into a dictionary item.

9981 ['ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G168.4', 'Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G168', 'Name=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G168.4', 'note=ENSACLP00000031356.1|alnLen:497|len:497|cov:100.00|id:98.98|score:2370', 'alias=Astatotilapia_calliptera_ENSACLP00000031356.1.m1', 'canonical_junctions=1,2,3,4,5,6,8,9,10,11', 'canonical_number=10', 'canonical_proportion=0.9090909090909091', 'ccode==', 'cds_padded=True', 'deletions=0', 'frameshifts=0', 'has_start_codon=True', 'has_stop_codon=False', 'insertions=0', 'is_reference=True', 'padded=True', 'primary=False', 'retained_intron=False']
Traceback (most recent call last):
  File "test.py", line 8, in <module>
    attr = dict(item.split("=") for item in row[8].split(";"))
ValueError: dictionary update sequence element #8 has length 3; 2 is required
swarbred commented 4 years ago

We had the same issue with this in mikado at some point, this is the classcode id from mikado. Depending on how mikado is run this could be present multiple times. You wont see it in a normal mikado run but this file represents the models selected from the protein alignments and this will always contain these class codes. I can of course just strip from the input but it might be useful to clean this in gmc.

cschu commented 4 years ago

You mean there will potentially be multiple instances of "cclass==" in a file or in a record?

I'm happy to clean that, but why are they written like this in the first place? This seems to introduce an extra level of complexity when the attrib field cannot be parsed properly as key, value pairs without putting in extra parsing effort.

swarbred commented 4 years ago

You mean there will potentially be multiple instances of "cclass==" in a file or in a record?

In the file

we have a classcode of equals and = was used to represent this, however models with this classcode were never output so we never had any issue with them being a tag value pair in a gff. Then we added a feature to mikado which made models with this ccode valid. In retrospect we should have chosen something else.

swarbred commented 4 years ago
  "gmc_generate_tx2gene_maps": {
    "memory": "4096",
    "cores": "1",
    "J": "gmc_generate_tx2gene_maps"
  }

config indicates to use 1 core for gmc_generate_tx2gene_maps but on running uses 30

26547288 swarbred ei-cb 1 30 4G RUNNING 2020-04-25T11:50:09 2020-04-25T11:50:10 0:08 t128n89 gmc_generate_tx2gene

cschu commented 4 years ago

Meh... :

threads:
        HPC_CONFIG.get_cores("gmc_mikado_prepare")
cschu commented 4 years ago

Try with gmc-1.0.1.

swarbred commented 4 years ago

Concerning mRNA/ncRNA, I will add in ncRNA when I patch this issue.

In our GFF files we should only have ncRNA and mRNA, but there may be other features that people could potentially supply. mikado may work with a broader range of feature types, I think it only really cares about exon lines but I will check with luca. Potentially (only thinking out loud) you could derive the mapping from mRNA to gene by going from exon to it's parent to the parent of this feature (i.e. gene) that way you don't need to know what the specific feature type is.

Unfortunately there are few variations here, we don't necessarily want to care about supporting everything, so it might just require some specification in the documentation about what is accepted.

My run that had a gtf input generated no tx2gene output, we should be able to accept gtf files as these are a common format and mikado works with these (it's probably better to provide these than gff) and they are easier to parse .

a gtf would look like

Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci     transcript      63335   65415   20.620000839233398      +       .       transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id 
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci     exon    63335   63522   .       +       .       transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci     exon    63593   65415   .       +       .       transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci     CDS     63474   63522   .       +       0       transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
Uromyces_beticola_EI_v1.1_contig_000001 Mikado_loci     CDS     63593   65115   .       +       2       transcript_id "mikado.Uromyces_beticola_EI_v1.1_contig_000001G1.1"; gene_id "mikado.Uromyces_beticol
swarbred commented 4 years ago

Unfortunately the 1.0 run generated an error at the point of generating the summary. the busco short summary files appear to be present as are the tx2gene files (error below)

In this run the busco genome output was given as a command line input i.e not generated in this gmc run (came from an earlier run). I mention it incase this is relevant though looking at your test busco summary output no genome run was included, so perhaps that's not implemented yet as it's a little different given no tx2gene file can be generated upfront. The gene2transcript linking could just be derived from the transcript names (gemy i assume does this, though the augustus run may just generate a single model per gene).

[Sat Apr 25 21:29:52 2020]
localrule busco_summary:
    input: /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_Uromyces_beticola/Analysis/gmc-0.8/2.0rc6_d094f99_CBG/GMC_1.0_run1_d094f99/busco/runs/proteins_prepare/U
    output: GMC_1.0_run1_d094f99/busco_final_table.tsv
    jobid: 118
    reason: Missing output files: GMC_1.0_run1_d094f99/busco_final_table.tsv; Input files updated by another job: /ei/projects/e145cad7-983f-4cae-b7e4-946300bcee57/data/results/CB-GENANNO-457_Annotation_U

ESC[33mJob counts:
        count   jobs
        1       busco_summary
        1ESC[0m
ESC[32m[Sat Apr 25 21:29:55 2020]ESC[0m
ESC[31mError in rule busco_summary:ESC[0m
ESC[31m    jobid: 0ESC[0m
ESC[31m    output: GMC_1.0_run1_d094f99/busco_final_table.tsvESC[0m
ESC[31mESC[0m
ESC[31mRuleException:
KeyError in line 1135 of /ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk:
'mikado.loci.ptn_coding_mikado.Uromyces_beticola_EI_v1.1_contig_000714G1.3'
  File "/ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/zzz/gmc_run.smk", line 1135, in __rule_busco_summary
  File "/ei/software/cb/gmc/1.0/x86_64/lib/python3.6/site-packages/gmc/scripts/analyse_busco.py", line 20, in read_full_table
  File "/tgac/software/testing/python_miniconda/4.5.4_py3.6_cs/x86_64/lib/python3.6/concurrent/futures/thread.py", line 56, in runESC[0m
ESC[31mExiting because a job execution failed. Look above for error messageESC[0m
Trying to restart job 118.
Resources before job selection: {'_cores': 9223372036854775807, '_nodes': 100}
Ready jobs (1):
        busco_summary
Selected jobs (1):
        busco_summary
Resources after job selection: {'_cores': 9223372036854775806, '_nodes': 99}
swarbred commented 4 years ago

Hi @cschu

The above issue is caused by the wrong "label" being prefixed to the transcript id

print("{}_{}".format(row[1], attr["ID"]), attr["Parent"], file=tx2gene_out, flush=True, sep="\t")

Currently you prefix col2 to the transcript name, but this should be the label as found in the list.txt file which mikado uses as the prefix for the prepare transcripts (and which you are using to set the busco directory name e.g. under busco/runs/transcripts_prepare).

This worked for you in your test set as the source column in the gff (col2) was the same as the label in the list.txt file, however this will not always be true.

In my case as the label and source were different then the transcript names don't agree between the tx2gene file and the busco full_table.tsv file.

Once I manually update the tx2gene file this now generates the busco_final_table.tsv file.

swarbred commented 4 years ago

from a quick look, the final transcript and protein numbers are not correct, these look the same as the BUSCO reported figures

this is my run

|:-----------------------|:------------------|:---------------------------------------|:----------------------------------|:----------------------------------|:----------------
| Busco Plots            | transcripts_final | transcripts_prepare/mikado_protein_run | transcripts_prepare/augustus_run3 | transcripts_prepare/augustus_run1 | transcripts_prep
| Complete (single copy) | 2064              | 2495                                   | 3363                              | 3329                              | 3162            
| Complete (2 copies)    | 878               | 34                                     | 71                                | 82                                | 70              
| Complete (3 copies)    | 492               | 2                                      | 1                                 | 5                                 | 4               
| Complete (4+ copies)   | 0                 | 0                                      | 0                                 | 0                                 | 0               
| Complete               | 2064              | 2495                                   | 3363                              | 3329                              | 3162            
| Duplicated             | 1370              | 36                                     | 72                                | 87                                | 74              
| Fragmented             | 91                | 63                                     | 98                                | 104                               | 132             
| Missing                | 115               | 1046                                   | 107                               | 120                               | 272             
| Total                  | 3640              | 3640                                   | 3640                              | 3640                              | 3640            
134357at7898    Duplicated  OREMO8127_EIv1.0_0133490.1  200.4   104 https://www.orthodb.org/v10?query=134357at7898  LYR motif-containing protein 1
134357at7898    Duplicated  OREMO8127_EIv1.0_0133490.2  200.3   104 https://www.orthodb.org/v10?query=134357at7898  LYR motif-containing protein 1

This is a busco gene counted as duplicated but it's to two splice variants of the same gene, your current results are still counting this as duplicated

We can just remove the .1 .2 etc from the sequence name to get the gene name then collapse these and count the number of unique busco-gene names >= to 2 to get the number of duplicated BUSCOs

For transcripts_final the above table gives 1370 (same as BUSCO raw results), i.e.

awk '$2 == "Duplicated"' full_table.tsv |cut -f1 |sort -u |wc -l

to get the correct Duplicated counts we want

awk '$2 == "Duplicated"' full_table.tsv |cut -f1,3 |cut -d . -f1,2 |sort -u |cut -f1 |sort -n |uniq -c |sed 's/^ *//g' |awk '$1 >= 2' |wc -l;

which is 71 duplicated BUSCOs

swarbred commented 4 years ago

minor one while I remember the number reported against Complete would make sense to be all complete buscos i.e. the sum of single copy + 2+ 3+ etc.

gemygk commented 4 years ago

@cschu we also need the BUSCO --limit option to be configurable (correct me if it already exists). By default, we are running BUSCO with --limit 3 for all the modes (genome, proteins, transcripts) and report in the summary Complete (4+ copies) which is a bit misleading.

  --limit REGION_LIMIT  How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
cschu commented 4 years ago

@gemygk These can be set in the config file, params: busco: busco_type sections. Currently, there is only --offline as default parameter. I will put a --limit 4 as default, but you can amend/PR this as you wish.

The 4+ I took from @swarbred 's example above.

gemygk commented 4 years ago

Thanks @cschu, yes for now --limit 4 could be set as default.

cschu commented 4 years ago

Hi @swarbred,

minor one while I remember the number reported against Complete would make sense to be all complete buscos i.e. the sum of single copy + 2+ 3+ etc.

Ok, just for my understanding: duplicated will be a subset of complete?

Currently it is reported: complete=1 copy, duplicated=2+ copies, i will amend.

Regarding the false duplicates in the final_ runs: that transcript-number stripping is exactly what I have implemented to happen for post pick runs, but forgot to remove a line further downstream that overwrites this with the full gene id. This should be fixed in 1.0.2, which is currently undergoing a test run and should be available by the end of today.

swarbred commented 4 years ago

Ok, just for my understanding: duplicated will be a subset of complete?

Yes correct, complete is then reporting the same kind of figure as the raw busco output C= S+D.

cschu commented 4 years ago

@gemygk 👍

The output table categories are controlled separately in

misc: busco_max_copy_number: 4

I don't want to complicate the config file with yaml references, so the --limit and table outputs have to be set separately.

Thanks @cschu, yes for now --limit 4 could be set as default.

swarbred commented 4 years ago

Hi @cschu Just to clarify are these updates are included in 1.02? What about https://github.com/EI-CoreBioinformatics/gmc/issues/17#issuecomment-619552280 ?

cschu commented 4 years ago

Hi @swarbred,

unless I have missed something, everything from this discussion (except gtf support) and the last point from issue #18 (te_score decimals) should be addressed in 1.0.2.

swarbred commented 4 years ago

@cschu I think your parsing of the gff file for tx2gene errors when a mRNA line just has two tags and doesn't include a ; at the end of the line.

i.e. a gff with lines like below gave an error

Oreochromis_mossambicus_EIV1_scaffold_0000001   Mikado_loci mRNA    4478    8288    37  +   .   ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1.1;Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1

adding ; to the end of the line and the id mapping file was generated correctly

Oreochromis_mossambicus_EIV1_scaffold_0000001   Mikado_loci mRNA    4478    8288    37  +   .   ID=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1.1;Parent=mikado.Oreochromis_mossambicus_EIV1_scaffold_0000001G1;
cschu commented 4 years ago

I have addressed this now (1.0.3). I had to change the parser so that instances of "ccode==" can be processed without explicitly modifying the attrib column, but that new parser was not robust against things like the missing semicolon at the end.