Closed NickShanyt closed 1 year ago
in this way:
python metagenomesimulation.py test/5-2636-2.ini
Unfortunately there is an odd requirement for a CAMISIM's taxdumps, namely there being a folder NCBI
within the zipped taxdump, so unfortunately you would have to unzip your taxdump, put it in a folder named NCBI
and then zip it again. I am sorry for the hassle, please report back if that worked
Many thanks for your quickly reply.After testing, it works. Indeed, this is an aspect that is not easy to notice. When I compared the file contents and found a difference, there was indeed an extra layer of folders, but it didn't catch my attention.
————————————————————————
And
I need some help with the configuration file.
I tested different parameters of
genomes_total=5 genomes_real=5 number_of_samples=5
now I know number_of_samples
decide home many samples will be simulated.
genomes_total
decide home many Genomes will be selected randomly from metadata.tsv & _genome_toid.tsv, but I don't understand what would genomes_real=5
change.
And I tested it but find nothing different.
CAMISIM offers the possibility to simulate strains using sgEvolver from the mauve suite. If you define genomes_real
to be less than genomes_total
, then CAMISIM will simulate the difference in strains. If this is of no interest to you, then you can just keep genomes_real = genomes_total
CAMISIM offers the possibility to simulate strains using sgEvolver from the mauve suite. If you define
genomes_real
to be less thangenomes_total
, then CAMISIM will simulate the difference in strains. If this is of no interest to you, then you can just keepgenomes_real = genomes_total
I'm not sure if I understand correctly. For example, if I define genomes_real=20
and genomes_total=30
, does it mean that the program will simulate 10 (30-20=10)genomes based on the provided 20 genomes ?
Also, when I set the parameters as mentioned above, both parameters in the config.ini
file in the out folder are set to the same value, which is 30.
I'm not sure if they should be used in conjunction with other parameters?
Yes, in theory that is what should happen, I am confused that it did not work. Though the numbers in the output config.ini
are not necessarily wrong: After the genomes have been simulated they are treated as regular input genomes, so if you want to re-run your simulation, it would be correct to use 30 for genomes_real
because it would count the newly simulated genomes as "real" for the next run. You would have to check either the log of CAMISIM whether genomes were simulated and/or whether simulated genomes appear in the metadata.tsv
/genome_to_id.tsv
and the genomes
folder CAMISIM creates
Yes, in theory that is what should happen, I am confused that it did not work. Though the numbers in the output
config.ini
are not necessarily wrong: After the genomes have been simulated they are treated as regular input genomes, so if you want to re-run your simulation, it would be correct to use 30 forgenomes_real
because it would count the newly simulated genomes as "real" for the next run. You would have to check either the log of CAMISIM whether genomes were simulated and/or whether simulated genomes appear in themetadata.tsv
/genome_to_id.tsv
and thegenomes
folder CAMISIM creates
I check the genomes folder and the sequence files in there are all selected from my maetadata.
Ignoring this for now, I now have a new problem that arises when I add some new sequences as community2 ......
Oh wow, this whole thread is really a deep dive into functionalities of CAMISIM which are not used very often. It seems like the option to not anonymize the data set (you probably set this option to False?) is incompatible with creating multiple communities. I fear that if you want to simulate multiple communities and have non-anonymized reads, you will have to set the abundances of your second community manually. Alternatively it should work when turning anonymization on.
Oh wow, this whole thread is really a deep dive into functionalities of CAMISIM which are not used very often. It seems like the option to not anonymize the data set (you probably set this option to False?) is incompatible with creating multiple communities. I fear that if you want to simulate multiple communities and have non-anonymized reads, you will have to set the abundances of your second community manually. Alternatively it should work when turning anonymization on.
Sorry I'm a little late in replying. The one I mentioned wasn't actually a problem with CAMISIM itself, I located and fixed it myself. The problem was due to the presence of a "-" in the sequence ID, and Samtools was reporting an error for this problem.
Yeah, unfortunately there are some problems with special signs in sequence IDs, I am glad you got that figured out.
Hi, I downloaded the newer version of taxdump, actually the latest version,updated in 20230711.
when I ran it with the default version in the "tool/ncbi-taxonomy_20180226.tar.gz",it reported like that :
I thought may it not fit my data's newer taxid.
so I want to update taxdump,however, it reported like that :
I checked the newer version of taxdump, and the file "taxidlineage.dmp" does exist.
I'm not sure why