Closed mimouschka closed 6 months ago
hi @mimouschka there could be several causes to that particular issue. I don't have enough information to answer you more precisely.
could you please check if SWARM_8316
and SWARM_2575
are present in otu_counts_refined.tsv
? check for exact full-length matches.
vsearch
defines a fasta header as the string comprised between the initial '>' symbol and the first space, tab or the end of the line. If your OTU identifiers contain spaces or non-ascii characters, vsearch
will truncate the identifiers.
hi again! I do confirm that both SWARM_8316 and SWARM_2575 are present in the otu_counts_refined.tsv file.
My fasta headers are in the format ">SWARM_8316", so no spaces in headers.
attached the files (tsv and fasta) mumu-test.zip
The issue comes from your OTU table otu_counts_refined.tsv
.
In the fasta file and the generated matchlist mumu.matchlist
, OTU identifiers are written as such: SWARM_8316
, whereas in the OTU table, identifiers are quoted "SWARM_8316"
. mumu
sees that as two different identifiers (which is correct). When exporting your table into a tsv
file, you can choose to surround strings with quotes or not. Alternatively, you can remove the quotes post-facto:
sed -i 's/\"//g' otu_counts_refined.tsv
If you're starting from a phyloseq object, I suggest you try the mumu
wrapper mumu_pq.
Don't forget to close the issue if you consider it solved.
amazing, thanks so much ! Yes, I think I forgot to use quote=F
when exporting the tsv file, but your post-facto command worked perfectly too :)
Now, regarding your last comment on the mumu wrapper, I was actually thinking to use mumu right after swarm as it seems I can increase the number of threads, so mumu can handle larger tables. Do you think that would work? the main reason I first went into R before lulu was to clean up the data to have a smaller table to curate, so if I can spare myself the back and forth that would be great :)
Now, regarding your last comment on the mumu wrapper, I was actually thinking to use mumu right after swarm as it seems I can increase the number of threads, so mumu can handle larger tables. Do you think that would work?
swarm
and vsearch
are multithreaded and can take advantage of multi-core CPUs. However, mumu
is not yet multithreaded. There is a --threads
option but it has no effect, mumu
always uses one thread. I've developed mumu
because lulu
was too slow for the dataset I wanted to process (10,000 samples, 100,000 OTUs). mumu
is currently fast enough to process such datasets in less than an hour, so there is little incentive to make it multithreaded.
That being said, I would like to make mumu
faster at parsing input files. That could bring a 2x speed up.
Hi @frederic-mahe thanks for MUMU, this is great! and very well explained ;-) Following the manual, I did a matchlist using:
vsearch --usearch_global lulu_input/otu_refined.fasta --db lulu_input/otu_refined.fasta --self --id 0.84 --iddef 1 --userfields query+target+id --maxaccepts 0 --query_cov 0.9 --maxhits 10 --userout lulu/mumu.matchlist
and then ran mumu:
mumu --otu_table lulu_input/otu_counts_refined.tsv --match_list lulu/mumu.matchlist --new_otu_table lulu/OTU_table_mumu_84_0.9_1.tsv --log lulu/mumu.log --minimum_relative_cooccurence 0.90
I get many warnings such as:
As I produced the
otu_refined.fasta
and theotu_counts_refined
table from the same phyloseq object, I don't really understand how some OTUs in the match list are not in the OTU table...