Ecogenomics / GTDBTk

GTDB-Tk: a toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes.
https://ecogenomics.github.io/GTDBTk/
GNU General Public License v3.0
476 stars 82 forks source link

Genomes were filtered #302

Closed Biofarmer closed 3 years ago

Biofarmer commented 3 years ago

Hi,

May I ask how to deal with the genomes that have been filtered "... user genomes have amino acids in <10.0% of columns in filtered MSA"? I have attached the log file as below. There is no any error reported, but on classify folder produced. For example genomes from NCBI, GCF_002912445.1 and GCF_000758865.1.

[2021-01-09 17:50:41] INFO: GTDB-Tk v1.3.0
[2021-01-09 17:50:41] INFO: gtdbtk classify_wf --genome_dir /genome/ --out_dir /output/ --extension fna --cpus 60
[2021-01-09 17:50:41] INFO: Using GTDB-Tk reference data version r95: /data/databases/gtdb-tk/release95
[2021-01-09 17:50:41] INFO: Identifying markers in 3 genomes with 60 threads.
[2021-01-09 17:50:41] INFO: Running Prodigal V2.6.3 to identify genes.
[2021-01-09 17:51:10] INFO: Identifying TIGRFAM protein families.
[2021-01-09 17:51:19] INFO: Identifying Pfam protein families.
[2021-01-09 17:51:20] INFO: Annotations done using HMMER 3.1b2 (February 2015).
[2021-01-09 17:51:21] INFO: Done.
[2021-01-09 17:51:25] INFO: Aligning markers in 3 genomes with 60 threads.
[2021-01-09 17:51:25] INFO: Processing 2 genomes identified as bacterial.
[2021-01-09 17:51:31] INFO: Read concatenated alignment for 30,238 GTDB genomes.
[2021-01-09 17:51:34] INFO: Masking columns of bacterial multiple sequence alignment using canonical mask.
[2021-01-09 17:52:17] INFO: Masked bacterial alignment from 41,155 to 5,040 AAs.
[2021-01-09 17:52:17] INFO: 2 bacterial user genomes have amino acids in <10.0% of columns in filtered MSA.
[2021-01-09 17:52:17] INFO: Creating concatenated alignment for 30,238 bacterial GTDB and user genomes.
[2021-01-09 17:52:17] INFO: All bacterial user genomes have been filtered out.
[2021-01-09 17:52:17] INFO: Processing 1 genomes identified as archaeal.
[2021-01-09 17:52:18] INFO: Read concatenated alignment for 1,672 GTDB genomes.
[2021-01-09 17:52:20] INFO: Masking columns of archaeal multiple sequence alignment using canonical mask.
[2021-01-09 17:52:22] INFO: Masked archaeal alignment from 32,675 to 5,124 AAs.
[2021-01-09 17:52:22] INFO: 1 archaeal user genomes have amino acids in <10.0% of columns in filtered MSA.
[2021-01-09 17:52:22] INFO: Creating concatenated alignment for 1,672 archaeal GTDB and user genomes.
[2021-01-09 17:52:23] INFO: All archaeal user genomes have been filtered out.
[2021-01-09 17:52:23] INFO: Done.
[2021-01-09 17:52:23] INFO: Done.

Many thanks Wang

donovan-h-parks commented 3 years ago

Hi. You can set the minimum percentage of amino acids required to be in the multiple sequence alignment with the min_perc_aa parameter. The default is 10% which is already low. I would not trust GTDB-Tk results for genomes that have <10% amino acids in the MSA. Generally, genomes with such a low percentage of amino acids are poor quality assemblies or "unusual" in some way (i.e., some endosymbionts with reduce genomes are likely to be a challenge).

Biofarmer commented 3 years ago

Hi, Donovan, Thank you very much for fast reply. It makes sense to me. Best, Wang

Biofarmer commented 3 years ago

Hi, I see some genomes that are annotated as 'Undefined (Failed Quality Check)' in GTDB website, may I ask if this is due to <10% amino acids in the MSA when running gtdbtk classify_wf, or anything else as indicated in in FAQ:

Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:

CheckM completeness estimate >50%
CheckM contamination estimate <10%
quality score, defined as completeness - 5*contamination, >50
contain >40% of the bac120 or arc122 marker genes
contain <1000 contigs
have an N50 >5kb
contain <100,000 ambiguous bases

Thanks Wang

donovan-h-parks commented 3 years ago

GTDB-Tk only filters genomes based on the 10% amino acid criteria. It otherwise assumes you are providing genomes are reasonable quality.

Biofarmer commented 3 years ago

Okay, thanks for confirmation. But, it cannot know which reason, right?

Biofarmer commented 3 years ago

But, it is not possible to know which the exact reason is, right?

donovan-h-parks commented 3 years ago

GTDB-Tk only filters genomes if they fail the --min_perc_aa criterion which excludes genomes that do not have at least the specified percentage of AA in the MSA. Why a given genome fails this test is not something GTDB-Tk can determine.

Biofarmer commented 3 years ago

Sorry, I mean: for example: GCA_001028125.1 is "Undefined (Failed Quality Check)" in GTDB website, and I am just curious to know this is due to <10% amino acid criteria, or it is initially not included in GTDB due to the criteria mentioned in FAQ.

donovan-h-parks commented 3 years ago

The following file indicates information about genome failing the GTDB inclusion criteria: https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/qc_failed.tsv

Biofarmer commented 3 years ago

Good to learn, and many thanks.