Closed Biofarmer closed 3 years ago
Hi. You can set the minimum percentage of amino acids required to be in the multiple sequence alignment with the min_perc_aa
parameter. The default is 10% which is already low. I would not trust GTDB-Tk results for genomes that have <10% amino acids in the MSA. Generally, genomes with such a low percentage of amino acids are poor quality assemblies or "unusual" in some way (i.e., some endosymbionts with reduce genomes are likely to be a challenge).
Hi, Donovan, Thank you very much for fast reply. It makes sense to me. Best, Wang
Hi, I see some genomes that are annotated as 'Undefined (Failed Quality Check)' in GTDB website, may I ask if this is due to <10% amino acids in the MSA when running gtdbtk classify_wf, or anything else as indicated in in FAQ:
Genomes are obtained from NCBI and must meet the following criteria to be included in the GTDB reference trees and database:
CheckM completeness estimate >50%
CheckM contamination estimate <10%
quality score, defined as completeness - 5*contamination, >50
contain >40% of the bac120 or arc122 marker genes
contain <1000 contigs
have an N50 >5kb
contain <100,000 ambiguous bases
Thanks Wang
GTDB-Tk only filters genomes based on the 10% amino acid criteria. It otherwise assumes you are providing genomes are reasonable quality.
Okay, thanks for confirmation. But, it cannot know which reason, right?
But, it is not possible to know which the exact reason is, right?
GTDB-Tk only filters genomes if they fail the --min_perc_aa
criterion which excludes genomes that do not have at least the specified percentage of AA in the MSA. Why a given genome fails this test is not something GTDB-Tk can determine.
Sorry, I mean: for example: GCA_001028125.1 is "Undefined (Failed Quality Check)" in GTDB website, and I am just curious to know this is due to <10% amino acid criteria, or it is initially not included in GTDB due to the criteria mentioned in FAQ.
The following file indicates information about genome failing the GTDB inclusion criteria: https://data.gtdb.ecogenomic.org/releases/release202/202.0/auxillary_files/qc_failed.tsv
Good to learn, and many thanks.
Hi,
May I ask how to deal with the genomes that have been filtered "... user genomes have amino acids in <10.0% of columns in filtered MSA"? I have attached the log file as below. There is no any error reported, but on classify folder produced. For example genomes from NCBI, GCF_002912445.1 and GCF_000758865.1.
Many thanks Wang