CAMI-challenge / CAMISIM

CAMISIM: Simulating metagenomes and microbial communities
https://data.cami-challenge.org/participate
Apache License 2.0
169 stars 37 forks source link

Questions related to metadata #43

Closed kmin940 closed 6 years ago

kmin940 commented 6 years ago

Hi, I'm trying to simulate a dataset using CAMISIM but since I am relatively new to metagenomics, I do not have clear understandings on some names. My questions are :

1) I want to use genome with ncbi accession GCF_000006825.1, so I visited ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/006/825/GCF_000006825.1_ASM682v1 From this site, do I only need to download GCF_000006825.1_ASM682v1_genomic.fna.gz (genome sequence) GCF_000006825.1_ASM682v1_genomic.gff.gz (genome annotation) and add the paths to id_to_genome and id_to_gff respectively?

2) I'm not sure what these 4 fields mean for metadata. i.e. Row 1 (header): genome_ID\tOTU\NCBI_ID\tnovelty_category What is genome ID for GCF_000006825.1? Where can I find this information? What is tOTU for GCF_000006825.1? Where can I find this information? I found from https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=272843 that taxonomy identifier for GCF_000006825.1 is 272843. What should I put for tnovelty_category?

It would also be great if you can provide some other concrete example genome with the corresponding genome_ID\tOTU\NCBI_ID\tnovelty_category. Thank you very much for your help.

AlphaSquad commented 6 years ago

Hi, let me try to answer your questions:

  1. the only thing you really need to download is the genome sequence (the *.fna.gz file). The gff file is only required, if you want so artificially add strains to your data set, if you don't want to do that then you can leave the id_to_gff field empty. The id_to_genome_file field expects a file which, tab-separated, expects a genome ID and the path; if you only want to use that single genome that file would for example look like this: GCF_000006825\t/path/to/GCF000006825.fna The '\t' in this string, as well as in the metadata file string you wrote stands for "tab-separated" files (.tsv), where the individual fields are separated by a single Tab.
  2. The genome ID is just a name you can give to your genome, you can choose GCF000006825 for example, you just need to make sure that the IDs are consistent between the metadata and the genome_to_id files. An OTU (Operational Taxonomic Unit) is a definition to classify groups of related genomes. If you have two genomes which are closely related to each other, these should get the same OTU. This can be any string (or number). If you don't care for genomic relationships you can give every genome its own OTU (e.g. its genome ID) or can give every genome the same OTU (e.g. 0). The novelty category is a measure of novelty of your genome in case you do not use published genomes for simulating metagenomes, i.e. if you are using a genome which was classified as being of a genus which has not been described before, the novelty category would be "new_genus". Since you are using database genomes I suggest setting this to "known_strain" for every genome. The NCBI ID you found looks correct to me, all the other fields have to be set by you yourself, the file might look like this:
    genome_ID\tOTU\tNCBI_ID\tnovelty_category
    GCF_000006825\t0\t272843\tknown_strain

I hope this helps. If there is anything more, feel free to comment or write a mail.

kmin940 commented 6 years ago

Thank you very much for your detailed answer. I've tried to run metagenome simulation, but got an error. Here is my configuration file located at /home/mathed/CAMI:

seed=1 phase=0 max_processors=4 dataset_id=sample_1 output_directory=/home/mathed/CAMI temp_directory=/home/mathed gsa=False pooled_gsa=False anonymous=True compress=1 readsim=/home/mathed/CAMISIM/tools/art_illumina-2.3.6/art_illumina error_profiles=/home/mathed/CAMISIM/tools/art_illumina-2.3.6/profiles samtools=/home/mathed/CAMISIM/tools/samtools-1.3/samtools profile=mbarc size=1 type=art fragments_size_mean=100 fragment_size_standard_deviation=5 ncbi_taxdump=/home/mathed/CAMISIM/tools/ncbi-taxonomy_20170222.tar.gz strain_simulation_template=/home/mathed/CAMISIM/scripts/StrainSimulationWrapper/sgEvolver/simulation_dir number_of_samples=1 metadata=/home/mathed/CAMI/metadata id_to_genome_file=/home/mathed/CAMI/genome_to_id.tsv id_to_gff_file=/home/mathed/CAMI/id_to_gff.tsv genomes_total=9 genomes_real=5 max_strains_per_otu=5 ratio=1 mode=replicates log_mu=1 log_sigma=2 gauss_mu=1 gauss_sigma=2 view=True

This is my genome_to_id file with path /home/mathed/CAMI/genome_to_id.tsv GCF_000006825 /home/mathed/Strains/GCF_000006825.1_ASM682v1_genomic.fna GCF_000009545 /home/mathed/Strains/GCF_000009545.1_ASM954v1_genomic.fna GCF_000255915 /home/mathed/Strains/GCF_000255915.1_ASM25591v1_genomic.fna GCF_000259545 /home/mathed/Strains/GCF_000259545.1_ASM25954v1_genomic.fna GCF_000975325 /home/mathed/Strains/GCF_000975325.1_ASM97532v1_genomic.fna

This is my id_to_gff_file with path /home/mathed/CAMI/id_to_gff.tsv GCF_000006825 /home/mathed/Strains/GCF_000006825.1_ASM682v1_genomic.gff GCF_000009545 /home/mathed/Strains/GCF_000009545.1_ASM954v1_genomic.gff GCF_000255915 /home/mathed/Strains/GCF_000255915.1_ASM25591v1_genomic.gff GCF_000259545 /home/mathed/Strains/GCF_000259545.1_ASM25954v1_genomic.gff GCF_000975325 /home/mathed/Strains/GCF_000975325.1_ASM97532v1_genomic.gff

This is my metadata with path /home/mathed/CAMI/metadata genome_ID OTU NCBI_ID novelty_category GCF_000006825 747 272843 known_strain GCF_000009545 747 218495 known_strain GCF_000255915 747 1132496 known_strain GCF_000259545 1349 584721 known_strain GCF_000975325 1349 1450183 known_strain

And this is the code and error message $ python /home/mathed/CAMISIM/metagenomesimulation.py /home/mathed/CAMI/configuration ERROR: /home/mathed/CAMI/configuration What have I done wrong?

2.Also, can I generate paired-end reads? How can I generate it? 3.Can I get access to coverage that will be generated by reads simulator of each strain in each sample? 4.What does community mean? What is the difference between community and samples? 5.Also in the configuration file, dataset_id=sample_1, and number_of_samples=1. What should I put for dataset_id if I have number_of_samples=2? Do I put dataset_id=sample_1,sample_2?

Thank you very much.

kmin940 commented 6 years ago

Thank you very much for your detailed answer. I have several more questions I'm emailing you to send image files. Also uploaded on issue section! I've tried to run metagenome simulation, but got an error. Here is my configuration file located at /home/mathed/CAMI: [image: image.png]

This is my genome_to_id file with path /home/mathed/CAMI/genome_to_id.tsv [image: image.png]

This is my id_to_gff_file with path /home/mathed/CAMI/id_to_gff.tsv [image: image.png]

This is my metadata with path /home/mathed/CAMI/metadata [image: image.png]

And this is the code and error message $ python /home/mathed/CAMISIM/metagenomesimulation.py /home/mathed/CAMI/configuration ERROR: /home/mathed/CAMI/configuration What have I done wrong?

2.Also, can I generate paired-end reads? How can I generate it? 3.Can I get access to coverage that will be generated by reads simulator of each strain in each sample? 4.What does community mean? What is the difference between community and samples? 5.Also in the configuration file, dataset_id=sample_1, and number_of_samples=1. What should I put for dataset_id if I have number_of_samples=2? Do I put dataset_id=sample_1,sample_2?

Thank you very much.

On Mon, Oct 15, 2018 at 4:26 PM Adrian Fritz notifications@github.com wrote:

Hi, let me try to answer your questions:

  1. the only thing you really need to download is the genome sequence (the *.fna.gz file). The gff file is only required, if you want so artificially add strains to your data set, if you don't want to do that then you can leave the id_to_gff field empty. The id_to_genome_file field expects a file which, tab-separated, expects a genome ID and the path; if you only want to use that single genome that file would for example look like this: GCF_000006825\t/path/to/GCF000006825.fna The '\t' in this string, as well as in the metadata file string you wrote stands for "tab-separated" files (.tsv), where the individual fields are separated by a single Tab.
  2. The genome ID is just a name you can give to your genome, you can choose GCF000006825 for example, you just need to make sure that the IDs are consistent between the metadata and the genome_to_id files. An OTU (Operational Taxonomic Unit) is a definition to classify groups of related genomes. If you have two genomes which are closely related to each other, these should get the same OTU. This can be any string (or number). If you don't care for genomic relationships you can give every genome its own OTU (e.g. its genome ID) or can give every genome the same OTU (e.g. 0). The novelty category is a measure of novelty of your genome in case you do not use published genomes for simulating metagenomes, i.e. if you are using a genome which was classified as being of a genus which has not been described before, the novelty category would be "new_genus". Since you are using database genomes I suggest setting this to "known_strain" for every genome. The NCBI ID you found looks correct to me, all the other fields have to be set by you yourself, the file might look like this:

genome_ID\tOTU\tNCBI_ID\tnovelty_category GCF_000006825\t0\t272843\tknown_strain

I hope this helps. If there is anything more, feel free to comment or write a mail.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CAMI-challenge/CAMISIM/issues/43#issuecomment-429736175, or mute the thread https://github.com/notifications/unsubscribe-auth/Ap879neIuq6o3iwTckGI_l2HO91lf8Moks5ulDiOgaJpZM4XbIwe .

AlphaSquad commented 6 years ago

Hi, unfortunately this is quite hard to read (markdown does not handle tabs very well). Since these questions are also not really "code-related": Could you send me these questions via Mail (adrian.fritz at helmholtz-hzi.de) so we can talk in detail there?

kmin940 commented 6 years ago

I've sent the mail. Can you take a look at it? Thank you.