Closed chtsai0105 closed 1 year ago
Hello @chtsai0105, thank you for reaching out. We can try to look in to this further, but first I would like to note that only marker-containing contigs will be in the '{kingdom}.markers.tsv'
files. Any contigs that do not contain markers will not be found in these files. (It appears NODE_2954_length_12339_cov_4.191143
may not contain any of these markers)
Is NODE_2954_length_12339_cov_4.191143
within the taxonomy.tsv
file?
Is it in any of the written fasta files? Any outputs from this?
grep -l ">NODE_2954_length_12339_cov_4.191143" autometa-taxonomy-output-directory/*.fasta
Hi Evan,
I ran the command to grep all the fasta in taxonomy output and found nothing. These contigs are totally gone in the following steps.
I then checked the diamond blastp output and these contigs are also missing. Just for curious, I also extract some contig names from unclassified.fna and eukaryota.fna and they do have matches in blastp output. So I guess this is what happening - some of the contigs cannot match anything through blastp so they're not recorded at all. These are probably misassembled contigs. For those contigs recorded as 'unclassified' are actually having ambiguous blastp matches and cannot being classified to certain kingdom.
Ah, we should probably sort this out so that we can pass these (contigs without blastp hits) on to unclassified.fna
.
This should be a relatively quick fix. I'll look in to the code and get back to you.
@WiscEvan was this fiixed?
Hi @chtsai0105, this has been fixed in #343 and is included in the most recent release v2.2.1 (also on main and dev)
🚀 Release v2.2.1: https://github.com/KwanLab/Autometa/releases/tag/2.2.1
The bioconda distribution is not currently available but will be updated in ~24-48 hours
User checklist
Description
Hi, sorry to bother again! I was working on after-binning analysis and load the
taxonomy.tsv
, one of the output of theautometa-taxonomy
, to pandas. I found there are only 32,239 contigs while the original contig fasta after filtering has 32,329 - 90 contigs are missing. I thought any contig that autometa cannot classified would finally being throw into theunclassified.fasta
but seems like it's not the case.First I checked whether these are all short contigs. Indeed, most of them are quite short, just a bit above the 3,000 bp cutoff. However, some contigs over 10k are still included in this list.
I then focus on the longest contig to see on which step it have been discarded. I can still see it In the output of
autometa-orf
grep NODE_2954_length_12339_cov_4.191143 predict_orfs.fna
However, in the output of the
autometa-markers
(bothhmmscan
andmarkers
), the contig is missing.I know there is also an option allow one to specify the kingdom (bacteria/archaea) while running
autometa-markers
. But even I only ran on bacteria, I still got several fasta from different kingdoms and also an unclassified fasta afterautometa-taxonomy
. In this case, I think it's better to keep those contigs without any markers into the unclassified fasta.Expected Behavior
Expect the contig numbers would be the same as the original contig fasta. Those contigs that are unable to be classified would be throw into unclassified.fasta.