AllTheBacteria / AllTheBacteria

Follow up to Grace Blackwell's 661k dataset, for 2023
MIT License
85 stars 4 forks source link

Some samples are actually Host genomes #37

Open yhg926 opened 2 weeks ago

yhg926 commented 2 weeks ago

We just found some unusal genomes yesterday : like achromobacter_xylosoxidans_01/SAMN12335635 and acinetobacter_baylyi/SAMEA6124625... these genemes share almost no k-mer with others of the same species. After a quick discussion with Pro. Wei Shen, one of the authors from your team. I have a basic idea of the cause:

  1. Sylph only reports GTDB species, but these samples contains non-GTDB species (e.g. sponges for acinetobacter_baylyi/SAMEA6124625.), which cannot be detected by Sylph;
  2. The assembly software ignored the low-reads content of Acinetobacter baylyi and did not generate a contig for it. All the contigs came from the sponge, so there are no shared k-mers with other Acinetobacter baylyi samples.

So. I will suggest to compare each genome to an GTDB representative genome of this species , and filtered those with low similarity. My software KSSD (https://github.com/yhg926/public_kssd) will be a nice tool for this task and i am more than willing to give a help .

Best, Huiguang Yi

martinghunt commented 2 weeks ago

Thanks for spotting this! Do you know how many samples this affects? Will have a look into it when I get more time. Need to see what else sylph reported for those samples. First thoughts are whether we can add more filtering so that these would at least get removed from the high quality set (they are currently both counted as "high quality").

yhg926 commented 2 weeks ago

There are more, but I had not explore them thoroughly. Maybe your high quality set only consider the assembly quality?, i believe these sample were both well-assembled, but they are not a prokayotic genome (though may contain one), and slyph won't report non-prokayotic compositions. If it is possible, include a checkM QC will be helpful.

iqbal-lab commented 2 weeks ago

Our HQ definition currently uses assembly quality, and checkM and sylph (major species abundance threshold). It looks like we need an additional filter to ensure the majority of the data cones from a species we recognise 8n GTDB