CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
167 stars 37 forks source link

"ncbi_taxdump=tools/ncbi-taxonomy_20170222.tar.gz" Unsupported newer version. #160

Closed NickShanyt closed 1 year ago

NickShanyt commented 1 year ago

Hi, I downloaded the newer version of taxdump, actually the latest version,updated in 20230711.

when I ran it with the default version in the "tool/ncbi-taxonomy_20180226.tar.gz",it reported like that :

image

I thought may it not fit my data's newer taxid.

so I want to update taxdump,however, it reported like that : image

I checked the newer version of taxdump, and the file "taxidlineage.dmp" does exist.

I'm not sure why

NickShanyt commented 1 year ago

in this way:

python metagenomesimulation.py test/5-2636-2.ini

AlphaSquad commented 1 year ago

Unfortunately there is an odd requirement for a CAMISIM's taxdumps, namely there being a folder NCBI within the zipped taxdump, so unfortunately you would have to unzip your taxdump, put it in a folder named NCBI and then zip it again. I am sorry for the hassle, please report back if that worked

NickShanyt commented 1 year ago

Many thanks for your quickly reply.After testing, it works. Indeed, this is an aspect that is not easy to notice. When I compared the file contents and found a difference, there was indeed an extra layer of folders, but it didn't catch my attention.

———————————————————————— And I need some help with the configuration file. I tested different parameters of genomes_total=5 genomes_real=5 number_of_samples=5

now I know number_of_samples decide home many samples will be simulated. genomes_total decide home many Genomes will be selected randomly from metadata.tsv & _genome_toid.tsv, but I don't understand what would genomes_real=5 change. And I tested it but find nothing different.

AlphaSquad commented 1 year ago

CAMISIM offers the possibility to simulate strains using sgEvolver from the mauve suite. If you define genomes_real to be less than genomes_total, then CAMISIM will simulate the difference in strains. If this is of no interest to you, then you can just keep genomes_real = genomes_total

NickShanyt commented 1 year ago

CAMISIM offers the possibility to simulate strains using sgEvolver from the mauve suite. If you define genomes_real to be less than genomes_total, then CAMISIM will simulate the difference in strains. If this is of no interest to you, then you can just keep genomes_real = genomes_total

I'm not sure if I understand correctly. For example, if I define genomes_real=20 and genomes_total=30, does it mean that the program will simulate 10 (30-20=10)genomes based on the provided 20 genomes ? Also, when I set the parameters as mentioned above, both parameters in the config.ini file in the out folder are set to the same value, which is 30.

I'm not sure if they should be used in conjunction with other parameters?

AlphaSquad commented 1 year ago

Yes, in theory that is what should happen, I am confused that it did not work. Though the numbers in the output config.ini are not necessarily wrong: After the genomes have been simulated they are treated as regular input genomes, so if you want to re-run your simulation, it would be correct to use 30 for genomes_real because it would count the newly simulated genomes as "real" for the next run. You would have to check either the log of CAMISIM whether genomes were simulated and/or whether simulated genomes appear in the metadata.tsv/genome_to_id.tsv and the genomes folder CAMISIM creates

NickShanyt commented 1 year ago

Yes, in theory that is what should happen, I am confused that it did not work. Though the numbers in the output config.ini are not necessarily wrong: After the genomes have been simulated they are treated as regular input genomes, so if you want to re-run your simulation, it would be correct to use 30 for genomes_real because it would count the newly simulated genomes as "real" for the next run. You would have to check either the log of CAMISIM whether genomes were simulated and/or whether simulated genomes appear in the metadata.tsv/genome_to_id.tsv and the genomes folder CAMISIM creates

I check the genomes folder and the sequence files in there are all selected from my maetadata.

Ignoring this for now, I now have a new problem that arises when I add some new sequences as community2 ...... image

AlphaSquad commented 1 year ago

Oh wow, this whole thread is really a deep dive into functionalities of CAMISIM which are not used very often. It seems like the option to not anonymize the data set (you probably set this option to False?) is incompatible with creating multiple communities. I fear that if you want to simulate multiple communities and have non-anonymized reads, you will have to set the abundances of your second community manually. Alternatively it should work when turning anonymization on.

NickShanyt commented 1 year ago

Oh wow, this whole thread is really a deep dive into functionalities of CAMISIM which are not used very often. It seems like the option to not anonymize the data set (you probably set this option to False?) is incompatible with creating multiple communities. I fear that if you want to simulate multiple communities and have non-anonymized reads, you will have to set the abundances of your second community manually. Alternatively it should work when turning anonymization on.

Sorry I'm a little late in replying. The one I mentioned wasn't actually a problem with CAMISIM itself, I located and fixed it myself. The problem was due to the presence of a "-" in the sequence ID, and Samtools was reporting an error for this problem.

AlphaSquad commented 1 year ago

Yeah, unfortunately there are some problems with special signs in sequence IDs, I am glad you got that figured out.