KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

The number of contigs after autometa-taxonomy is not consistent to the original contig fasta #296

Closed chtsai0105 closed 1 year ago

chtsai0105 commented 2 years ago

User checklist

Description

Hi, sorry to bother again! I was working on after-binning analysis and load the taxonomy.tsv, one of the output of the autometa-taxonomy, to pandas. I found there are only 32,239 contigs while the original contig fasta after filtering has 32,329 - 90 contigs are missing. I thought any contig that autometa cannot classified would finally being throw into the unclassified.fasta but seems like it's not the case.

First I checked whether these are all short contigs. image Indeed, most of them are quite short, just a bit above the 3,000 bp cutoff. However, some contigs over 10k are still included in this list.

I then focus on the longest contig to see on which step it have been discarded. I can still see it In the output of autometa-orf

grep NODE_2954_length_12339_cov_4.191143 predict_orfs.fna

>NODE_2954_length_12339_cov_4.191143_1 # 3 # 5102 # 1 # ID=67_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.292
>NODE_2954_length_12339_cov_4.191143_2 # 5108 # 9229 # 1 # ID=67_2;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=12bp;gc_cont=0.295
>NODE_2954_length_12339_cov_4.191143_3 # 9222 # 9605 # 1 # ID=67_3;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=11bp;gc_cont=0.328
>NODE_2954_length_12339_cov_4.191143_4 # 9654 # 11663 # 1 # ID=67_4;partial=00;start_type=ATG;rbs_motif=TAA;rbs_spacer=13bp;gc_cont=0.325
>NODE_2954_length_12339_cov_4.191143_5 # 11690 # 12337 # 1 # ID=67_5;partial=01;start_type=ATG;rbs_motif=TAA;rbs_spacer=4bp;gc_cont=0.449

However, in the output of the autometa-markers (both hmmscan and markers), the contig is missing.

grep NODE_2954_length_12339_cov_4.191143 bacteria.hmmscan.tsv
grep NODE_2954_length_12339_cov_4.191143 bacteria.markers.tsv
(both have no output)

I know there is also an option allow one to specify the kingdom (bacteria/archaea) while running autometa-markers. But even I only ran on bacteria, I still got several fasta from different kingdoms and also an unclassified fasta after autometa-taxonomy. In this case, I think it's better to keep those contigs without any markers into the unclassified fasta.

Expected Behavior

Expect the contig numbers would be the same as the original contig fasta. Those contigs that are unable to be classified would be throw into unclassified.fasta.

evanroyrees commented 2 years ago

Hello @chtsai0105, thank you for reaching out. We can try to look in to this further, but first I would like to note that only marker-containing contigs will be in the '{kingdom}.markers.tsv' files. Any contigs that do not contain markers will not be found in these files. (It appears NODE_2954_length_12339_cov_4.191143 may not contain any of these markers)

Is NODE_2954_length_12339_cov_4.191143 within the taxonomy.tsv file?

Is it in any of the written fasta files? Any outputs from this?

grep -l ">NODE_2954_length_12339_cov_4.191143" autometa-taxonomy-output-directory/*.fasta
chtsai0105 commented 2 years ago

Hi Evan,

I ran the command to grep all the fasta in taxonomy output and found nothing. These contigs are totally gone in the following steps.

I then checked the diamond blastp output and these contigs are also missing. Just for curious, I also extract some contig names from unclassified.fna and eukaryota.fna and they do have matches in blastp output. So I guess this is what happening - some of the contigs cannot match anything through blastp so they're not recorded at all. These are probably misassembled contigs. For those contigs recorded as 'unclassified' are actually having ambiguous blastp matches and cannot being classified to certain kingdom.

evanroyrees commented 2 years ago

Ah, we should probably sort this out so that we can pass these (contigs without blastp hits) on to unclassified.fna.

This should be a relatively quick fix. I'll look in to the code and get back to you.

chasemc commented 1 year ago

@WiscEvan was this fiixed?

evanroyrees commented 1 year ago

Hi @chtsai0105, this has been fixed in #343 and is included in the most recent release v2.2.1 (also on main and dev)

🚀 Release v2.2.1: https://github.com/KwanLab/Autometa/releases/tag/2.2.1

The bioconda distribution is not currently available but will be updated in ~24-48 hours