RobertsLab / resources

https://robertslab.github.io/resources/
18 stars 10 forks source link

Geoduck genome v074 BUSCO score low? #825

Closed shellywanamaker closed 4 years ago

shellywanamaker commented 4 years ago

When I presented the geoduck data at PAG 2020, some people thought the BUSCO score of 71.5% (which our manuscript currently reports) was low, as was previously discussed on our slack channel (https://genefish.slack.com/archives/GG4HW5SC9/p1551386822001000). According to @kubu4 's notebook post, BUSCO was run only on the 18 scaffolds. So any genes in unplaced sequences (or data not included in the 18 scaffolds) would be missed. Someone suggested runnning BUSCO on all the data to get a better estimate of completeness and see if there are regions we should be including. But maybe we just report as is for our current manuscript and include caveats. Someone also mentioned if the species has high sequence divergence, then we shouldn't expect a high BUSCO. @kubu4 have you run BUSCO including all the data? Maybe on v070?

kubu4 commented 4 years ago

BUSCO for v070 is 84.2%

https://robertslab.github.io/sams-notebook/2019/07/10/Genome-Assessment-BUSCO-Metazoa-on-Pgenerosa_v070-on-Mox.html

sr320 commented 4 years ago

curious why ….

augustus_species=fly

On Jan 14, 2020, 1:28 PM -0800, kubu4 notifications@github.com, wrote:

BUSCO for v070 is 84.2% https://robertslab.github.io/sams-notebook/2019/07/10/Genome-Assessment-BUSCO-Metazoa-on-Pgenerosa_v070-on-Mox.html — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

kubu4 commented 4 years ago

Curious why using fly or curious why difference in scores?

kubu4 commented 4 years ago

If the former, D.melanogaster is the most closely related species available in Augustus, and Augustus requires the user to select a species.

kubu4 commented 4 years ago

Here's the list of species August currently accepts:

  - Homo sapiens (human), 
  - Drosophila melanogaster (fruit fly), 
  - Arabidopsis thaliana (plant),
  - Brugia malayi (nematode),
  - Aedes aegypti (mosquito),
  - Coprinus cinereus (fungus),
  - Tribolium castaneum (bug)
  - Schistosoma mansoni (worm)
  - Tetrahymena thermophila (ciliate)
  - Galdieria sulphuraria (red algae)
  - Zea mays (maize)
  - Toxoplasma gondii (parasitic protozoa)
  - Caenorhabditis elegans (worm)
  - Aspergillus fumigatus
  - Aspergillus nidulans
  - Aspergillus oryzae
  - Aspergillus terreus
  - Botrytis cinerea
  - Candida albicans
  - Candida guilliermondii
  - Candida tropicalis
  - Chaetomium globosum
  - Coccidioides immitis
  - Cryptococcus neoformans gattii
  - Cryptococcus neoformans neoformans
  - Debaryomyces hansenii
  - Encephalitozoon cuniculi
  - Eremothecium gossypii
  - Fusarium graminearum
  - Histoplasma capsulatum
  - Kluyveromyces lactis
  - Laccaria bicolor
  - Lodderomyces elongisporus
  - Magnaporthe grisea
  - Neurospora crassa
  - Petromyzon marinus (sea lamprey)
  - Phanerochaete chrysosporium
  - Pichia stipitis
  - Rhizopus oryzae
  - Saccharomyces cerevisiae
  - Schizosaccharomyces pombe
  - Ustilago maydis
  - Yarrowia lipolytica
  - Nasonia vitripennis (wasp)
  - Solanum lycopersicum (tomato)
  - Chlamydomonas reinhardtii (green algae)
  - Amphimedon queenslandica (sponge)
  - Acyrthosiphon pisum (pea aphid)
  - Leishmania tarentolae (protozoa, intronless)
  - Trichinella spiralis
sr320 commented 4 years ago

for fun, could you run setting the lineage as metazoa?

On Jan 14, 2020, 2:33 PM -0800, kubu4 notifications@github.com, wrote:

Here's the list of species August currently accepts:

  • Homo sapiens (human),
  • Drosophila melanogaster (fruit fly),
  • Arabidopsis thaliana (plant),
  • Brugia malayi (nematode),
  • Aedes aegypti (mosquito),
  • Coprinus cinereus (fungus),
  • Tribolium castaneum (bug)
  • Schistosoma mansoni (worm)
  • Tetrahymena thermophila (ciliate)
  • Galdieria sulphuraria (red algae)
  • Zea mays (maize)
  • Toxoplasma gondii (parasitic protozoa)
  • Caenorhabditis elegans (worm)
  • Aspergillus fumigatus
  • Aspergillus nidulans
  • Aspergillus oryzae
  • Aspergillus terreus
  • Botrytis cinerea
  • Candida albicans
  • Candida guilliermondii
  • Candida tropicalis
  • Chaetomium globosum
  • Coccidioides immitis
  • Cryptococcus neoformans gattii
  • Cryptococcus neoformans neoformans
  • Debaryomyces hansenii
  • Encephalitozoon cuniculi
  • Eremothecium gossypii
  • Fusarium graminearum
  • Histoplasma capsulatum
  • Kluyveromyces lactis
  • Laccaria bicolor
  • Lodderomyces elongisporus
  • Magnaporthe grisea
  • Neurospora crassa
  • Petromyzon marinus (sea lamprey)
  • Phanerochaete chrysosporium
  • Pichia stipitis
  • Rhizopus oryzae
  • Saccharomyces cerevisiae
  • Schizosaccharomyces pombe
  • Ustilago maydis
  • Yarrowia lipolytica
  • Nasonia vitripennis (wasp)
  • Solanum lycopersicum (tomato)
  • Chlamydomonas reinhardtii (green algae)
  • Amphimedon queenslandica (sponge)
  • Acyrthosiphon pisum (pea aphid)
  • Leishmania tarentolae (protozoa, intronless)
  • Trichinella spiralis — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
sr320 commented 4 years ago

also might be worth running v4, has automated lineage selection

BUSCOv4 - Benchmarking sets of Universal Single-Copy Orthologs.
Main changes in v4:

Automated selection of lineages issued from https://www.orthodb.org/ release 10

Automated download of all necessary files and datasets to conduct a run
kubu4 commented 4 years ago

You can't select a lineage with Augustus; species only (from that list above).

For BUSCO, the BUSCO database used was metazoa.

shellywanamaker commented 4 years ago

@kubu4 do you need to do the -sp option? Fig. 2 in BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics shows "when no pretrained parameter set is available...BUSCO-trained predictions are substantially better than using Augustus parameters from another arthropod (fly)." Is there a way to just use BUSCO-trained predictions? I'm not even sure if this is relevant here, I don't really understand why Augustus is used

kubu4 commented 4 years ago

I'll have to re-visit this, but it seems like BUSCO relies on Augustus (based on the BUSCO config file I created in that notebook linked above). However, with that said, not sure if setting an Augustus species is required when executing BUSCO...

sr320 commented 4 years ago

BUSCO for v070 is 84.2%

Is there any way to determine what portion of v070 gives us this BUSCO bump?

sr320 commented 4 years ago

disregard augustus Q asking if it was required for BUSCO...https://busco.ezlab.org/busco_userguide.html#third-party-components

kubu4 commented 4 years ago

Is there any way to determine what portion of v070 gives us this BUSCO bump?

There is! Working on pulling the info now...

kubu4 commented 4 years ago

Sorry this took so long. Had one small typo that I kept overlooking. Anyway, count of "Complete" BUSCOs found in each scaffold in v070:

samb@computer:~/Downloads/busco_pgen/pgenv070$ awk -F "\t" 'NR > 6 && $2=="Complete" {print $3}' full_table_Pgenerosa_v070.tsv | sort | uniq -c | sort -rnk1,1 | awk -F"__" '{print $1}'
     84 PGA_scaffold1
     61 PGA_scaffold6
     55 PGA_scaffold8
     48 PGA_scaffold2
     47 PGA_scaffold3
     41 PGA_scaffold4
     37 PGA_scaffold12
     37 PGA_scaffold10
     36 PGA_scaffold5
     32 PGA_scaffold14
     27 PGA_scaffold13
     25 PGA_scaffold9
     25 PGA_scaffold7
     25 PGA_scaffold11
     24 PGA_scaffold16
     17 PGA_scaffold15
     14 PGA_scaffold17
      9 PGA_scaffold313634
      9 PGA_scaffold313633
      8 PGA_scaffold313628
      8 PGA_scaffold313624
      7 PGA_scaffold313623
      6 PGA_scaffold313636
      6 PGA_scaffold313625
      5 PGA_scaffold313637
      5 PGA_scaffold18
      4 PGA_scaffold313647
      4 PGA_scaffold313631
      3 PGA_scaffold313645
      3 PGA_scaffold313627
      2 PGA_scaffold718
      2 PGA_scaffold323
      2 PGA_scaffold313646
      2 PGA_scaffold313643
      2 PGA_scaffold313635
      2 PGA_scaffold313626
      1 PGA_scaffold85341
      1 PGA_scaffold72055
      1 PGA_scaffold70530
      1 PGA_scaffold68755
      1 PGA_scaffold67279
      1 PGA_scaffold67152
      1 PGA_scaffold66983
      1 PGA_scaffold66931
      1 PGA_scaffold65608
      1 PGA_scaffold65524
      1 PGA_scaffold65089
      1 PGA_scaffold63175
      1 PGA_scaffold61875
      1 PGA_scaffold61619
      1 PGA_scaffold58903
      1 PGA_scaffold58106
      1 PGA_scaffold57717
      1 PGA_scaffold57271
      1 PGA_scaffold55279
      1 PGA_scaffold55148
      1 PGA_scaffold5482
      1 PGA_scaffold53508
      1 PGA_scaffold50705
      1 PGA_scaffold47303
      1 PGA_scaffold43981
      1 PGA_scaffold39477
      1 PGA_scaffold39358
      1 PGA_scaffold38219
      1 PGA_scaffold32063
      1 PGA_scaffold31610
      1 PGA_scaffold313648
      1 PGA_scaffold313644
      1 PGA_scaffold313640
      1 PGA_scaffold313630
      1 PGA_scaffold313622
      1 PGA_scaffold30044
      1 PGA_scaffold27930
      1 PGA_scaffold212
      1 PGA_scaffold20469
      1 PGA_scaffold194334
      1 PGA_scaffold192
      1 PGA_scaffold12519
      1 PGA_scaffold121747
      1 PGA_scaffold114605