EI-CoreBioinformatics / minos

The labyrinth king judges your gene models.
GNU Lesser General Public License v3.0
9 stars 1 forks source link

Add BUSCO rules #17

Closed swarbred closed 4 years ago

swarbred commented 4 years ago

@cschu @gemygk I would like to get your views on adding rules for BUSCO, running against the genome, final output files and running against the prepare output then creating an external metric based on the output.

We (+others) will always run BUSCO on the final models so for our own convenience it would be good to add this to GMC. It's clear from looking at Koala that a BUSCO "complete" assignment doesn't always indicate the model is really complete (some are clear fragments) but it's a useful qc metric and the relative comparison between the genomic, transcript and protein BUSCO runs is informative as is comparing the prepare output BUSCO run to the final selected models. I would also like to use the BUSCO assignment in the mikado pick scoring. If a large multiplier is applied this would clearly make the final set of models look better from a BUSCO perspective, which could be viewed as artificial (as we are using busco to select the models and also evaluate them) but it's still a useful option to have if it helps select a more accurate set of models (and for mammalian genomes ~4000 BUSCOs is a big chunk of the geneset. As default we wouldn't apply a large multiplier so it's just another couple of metrics among many.

The idea would be to

1) Run directly against the genome (BUSCO genome mode) 2) Run against the mikado prepare transcripts AND against the extracted mikado prepare proteins (so another gffread extraction). Create two external metrics based on this output (based on the two runs) e.g. assigning 1 to any complete or duplicated model and 0.25 to any fragment . One question would be if it's best to run this against the full set of prepare transcripts and proteins or to split these files first (based on the mikado label) and run against each separately and combine the results. This might depend on the BUSCO --limit setting. We can test this by running BUSCO on the mikado prepare gmc file for koala. 3) Run against the final transcripts and proteins 4) Combine results into a tsv file, plus calculate a few other values based on the results.

Any issues?

swarbred commented 4 years ago

Hi @cschu

The 1.03 runs are looking good, but I have found an issue with the busco counts, it looks like we are counting an additional 6 genes, this is in every busco run and these are always 1 extra missing, 1 extra fragmented and 4 extra complete. Perhaps we are counting some additional lines?

The results GMC presents Busco Plots proteins_final
Complete (single copy) 1637
Complete (2 copies) 30
Complete (3 copies) 1
Complete (4+ copies) 1
Complete 1669
Duplicated 30
Fragmented 22
Missing 79
Total 1770

Raw busco results

# BUSCO was run in mode: proteins

    ***** Results: *****

    C:94.4%[S:41.6%,D:52.8%],F:1.2%,M:4.4%,n:1764      
    1665    Complete BUSCOs (C)            
    734 Complete and single-copy BUSCOs (S)    
    931 Complete and duplicated BUSCOs (D)     
    21  Fragmented BUSCOs (F)              
    78  Missing BUSCOs (M)             
    1764    Total BUSCO groups searched      
cschu commented 4 years ago

Hi @swarbred

Perhaps we are counting some additional lines?

Indeed, a wrongly initialised counter was the culprit. All 8 categories were initialized to 1 instead of 0. We end up with 6 extra counts instead of 8, because Total and Complete are later overwritten by the respective sums. This should be fixed now (looks fine in my test run) in version 1.0.4.

swarbred commented 4 years ago

Thanks @cschu

swarbred commented 4 years ago

Hi @cschu

The busco analysis is really great to have incorporated, there are three enhancements that would be useful to add to complete the BUSCO functionality.

1. best achievable busco counts

It would be useful to get the best achievable busco complete and missing counts from across all the busco-runs/proteins_preparefull_table.tsv and busco-runs/transcripts_preparefull_table.tsv and just add these to the busco_final_table.tsv (just comment lines below the table would be fine). This allows us to compare the busco results for the final run to the best achievable value if we had cherry picked models “perfectly”.

best busco complete count (all busco-runs/proteins_prepare) #count of unique busco ids marked as Complete or Duplicated across all proteins_prepare runs
best busco complete count (all busco-runs/transcripts_prepare) #count of unique busco ids marked as Complete or Duplicated across all transcripts_prepare runs

To get the best (i.e. lowest) Missing busco count this is the unique count of Missing busco ids that are COMMON to ALL proteins_prepare and transcripts_prepare runs (i.e. missing in all input GFFs)

best busco missing count (all busco-runs/proteins_prepare) #count of unique busco ids marked as Missing across all proteins_prepare runs
best busco missing count (all busco-runs/transcripts_prepare) #count of unique busco ids marked as Missing across all transcripts_prepare runs

2. Generate review tsv

Provide a table of final release transcript IDs where busco indicates these are not complete but where we had a model in the input gffs (prepare runs) which were marked as complete or duplicated, i.e. we could have selected a model which would have led to a better busco result, potentially these examples can be examined and tweaks made to the mikado scoring and GMC then rerun from the pick stage.

Pull out the BUSCO IDs that are Complete or Duplicated in busco-runs/proteins_preparefull_table.tsv but NOT Complete or Duplicated in busco-runs/proteins_finalfull_table.tsv

For these BUSCO IDs extract the corresponding transcript model ID (sequence field) from each of the busco-runs/proteins_preparefull_table.tsv restricting this to just those that are Complete or Duplicated. Extract the Transcript ID for the Fragmented entries (sequence field) and (BUSCO) Status (will be Missing or Fragmented) from busco-runs/proteins_final plus based on the transcript ID the Coordinates field from *sanity_checked.release.gff3.final_table.tsv, generate a tsv file.

Based on the file I examined, a busco id with status Fragmented links to just a single Transcript ID, i.e. busco never classes a duplicated busco as Fragmented, However I can’t be sure that is always true. Therefore I would suggest making the transcript ID the key rather than the Busco ID

Busco ID Transcript ID Busco Status Coordinates prepare TID (C or D) prepare TID coordinates
9250at3193 LATSA3860_EIv1.0_0456290.1 Fragmented ctg529:353298..362444* LATSA3860_run1_wRNA_LATSA3860_run1_0324540.1,LATSA3860_run3_woRNA_LATSA3860_run3_0237070.3,mikado_protein_run_mikado.ctg529G3.1 ctg529:353287..362445,ctg529:353617..362415,ctg529:353814..362346
10535at3193 - Missing - LATSA3860_run1_wRNA_LATSA3860_run1_0934570.1,LATSA3860_run2_woRNA_LATSA3860_run2_0720540.1,LATSA3860_run3_woRNA_LATSA3860_run3_0706050.1 ctg216:588997..592334,ctg216:588997..592334,ctg216:588997..592334

I’ve included the coordinates of the prepare transcripts in the above table to get these would requires a bit more work as they would need to be extracted from the mikado_prepared.gtf or you would need to generate the gffread.table.txt (following mikado prepare) based on the mikado_prepared.gtf file and then extract from this. It would be useful to have them to help review the missing BUSCOs

*Just for convenience can we write coordinates as ctg529:353298..362444 rather than ctg529:353298-362444 (as they are in the final_table.tsv, this way they can just be copied directly into the browser)

For ref the examples from the example table above http://apollo.tgac.ac.uk/CB-PPBFX-811_Lathyrus_sativus_JICv2_genome_browser/jbrowse/?loc=ctg529%3A346627..363006&tracks=DNA%2CAnnotations%2CLATSA3860_EIv1.0_GMC-1.0.4_run1_pick1%2CLATSA3860_run1_wRNA%2CLATSA3860_run2_woRNA%2CLATSA3860_run3_woRNA%2CMikado_transcript_PickRun3%2CMikado_protein_PickRun1%2CPsativum_exonerate_80cov50id&highlight=

http://apollo.tgac.ac.uk/CB-PPBFX-811_Lathyrus_sativus_JICv2_genome_browser/jbrowse/?loc=ctg216%3A584161..596260&tracks=DNA%2CAnnotations%2CLATSA3860_EIv1.0_GMC-1.0.4_run1_pick1%2CLATSA3860_run1_wRNA%2CLATSA3860_run2_woRNA%2CLATSA3860_run3_woRNA%2CMikado_transcript_PickRun3%2CMikado_protein_PickRun1%2CPsativum_exonerate_80cov50id&highlight=

3. generate mikado busco metric

As mentioned above it would be useful to have an option to create an external metric based on the busco results (default OFF, obviously can only be run if busco p mode is being run). I would use ONLY the busco prepare proteins runs (the transcript runs take longer to run so will delay the subsequent steps more, also we get more busco complete hits from the protein runs). Create an external metrics based on this output assigning 1 to any complete or duplicated model and 0.25 to any fragment.

cschu commented 4 years ago
  1. / 2. are available for testing in 1.2
cschu commented 4 years ago

Regarding 3., is there one metric for each set of input proteins or will they be somehow consolidated?

swarbred commented 4 years ago

Firstly thanks for looking at this I didn't expect you to do it over the weekend.

It's just a single metric, we would assign 1 or .25 based on the busco status of each transcript in full_table.tsv, there is no overlap in the transcript ids between the different prepare runs i.e. no tid appears in more than one busco output (these transcripts were extracted from the mikado prepare output). The metrics_matrix.txt would be updated to add the additional busco metric with the transcripts found by busco as completed/duplicated (1) or fragmented (0.25) being assigned the relevant score and all other transcripts being assigned 0.

The option on whether to use the busco metric would be set when generating the config, if set to ON then the mikado scoring config would have the metric added (similar to the other external metrics) if set to OFF then we would add the metric to the scoring but just comment it out (similar to some of the other external metrics we don't use as default).

The simplest approach would be to always generate the busco metric when protein busco is run and populate metrics_matrix.txt even when the user set the option as NOT to use the busco metric. This way the metric would be available just not used as the mikado scoring config file has this metric commented out. This would make the mikado serialise step dependent on the busco metric generation being complete, the downside being it will lengthen the run time. Doing it this way means that someone can do a gmc run with the busco metric not used then afterwards manually change the mikado scoring file to enable the busco metric then rerun from the pick stage. I'm ok with making downstream steps dependent on on the protein busco being completed unless you have another solution, from a quick look they complete within an hour.

If busco is not being run we can still have the busco metric present but commented out in the mikado scoring file, in that case not having any busco metric in metrics_matrix.txt shouldn't be an issue.

This has a few more complications than i originally envisaged.

swarbred commented 4 years ago

reran an earlier run using --rerun-from collapse_metrics (assume this should be compatible) see /ei/workarea/group-pb/CB-PPBFX-811_Annotation_of_Lathyrus_sativus/Analysis/gmc/mikado-2.0rc6_d094f99_CBG/GMC-1.0.4_run1/results

the missing counts are 0, so we must have this wrong somewhere (sorry haven't looked at the code yet)

# lowest achievable missing protein BUSCO count: 0
# lowest achievable missing transcript BUSCO count: 0

the review table only has one entry (should be 10s of examples)

Busco ID    Transcript ID   Busco Status    Coordinates prepare TID (C or D)    prepare TID coordinates
92128at3193 LATSA3860_EIv1.0_0283000.1  fragmented  ctg2871:231473..234189  LATSA3860_run3_woRNA_LATSA3860_run3_0600840.1   ctg2871:231349..234115
cschu commented 4 years ago

Hi @swarbred ,

First a question: I assume the busco metric does not go into the non_fragmentary expression?

Second, regarding the errors you found, yea that struck me a bit odd as well. I figured, the one entry would maybe be due to my test set.

Concerning the lowest missing count, you stated that this should be the common missing ones, which I thought I had implemented by intersecting all missing sets.

cschu commented 4 years ago

Ok, concerning these two errors with the counts and single table rows, I have found the likely causes for those and fixed them in the current 1.2 installation. (missing counts: wrong use of set intersections, single row in review table: indentation error)

swarbred commented 4 years ago

First a question: I assume the busco metric does not go into the non_fragmentary expression?

That's correct yes

swarbred commented 4 years ago

@cschu The busco metric seems to work as expected. In relation to that could we add the busco_score (busco_proteins_busco) to mikado.annotation.collapsed_metrics.tsv so that it can be used in the classification and update gmc_run.run_config.yaml so that we don't ever discard busco complete or fragmented models and regard models with a complete status as high confidence i.e. collapse_metrics_thresholds:

  discard: ''{busco_score} == 0 and {protein_score} == 0 and {transcript_score} == 0 and {hom_acov_score} == 0 and ({expression_score} < 0.3 or {short_cds} == 1)'

  hi_confidence: '{classification} == 1 or '{busco_score} == 1 or {hom_acov_score} >= 0.8 or ({hom_acov_score}

cschu commented 4 years ago

Can be tested with 1.3.

cschu commented 4 years ago

1.4 ready for testing (--busco-scoring will activate the busco score in the mikado scores config).

swarbred commented 4 years ago

Last comment has been tested, for me this can be closed

swarbred commented 4 years ago

Query @cschu

Currently if --busco-scoring is not specified then the scroring yaml contains

external.busco_proteins: {rescaling: max, use_raw: true, multiplier: 3}

i.e. commented out

How does this affect the metrics generation? i.e. are the busco metrics generated but just not used or are they not generated at all?

I would assume they are always generated irrespective of being used.

cschu commented 4 years ago

busco runs should not be affected by not specifying --busco-scoring, that option is only to turn on/off the scoring. Specifying --busco-scoring, however, will force busco protein analysis.