Understanding the output of simulator.py

huzuner commented 3 years ago

Hello,

I have a problem with the output generated by simulator.py for genome option. The script results in two fastq outputs which are so called aligned and unaligned fastq files (also the associated error profiles). I have read the manual and also your paper and I could not find a sufficient clarification and interpretation of results in terms of when to use which. I have two questions regarding this. I think it would make more sense for me to use the aligned reads because I think that one has the error model applied on it. In the following, when I set the number of reads with -n option, the number of reads is only satisfied with unaligned output. For example, when I set it to 100000, unaligned fastq file has ~98000 reads (which is OK) but the aligned has ~1700. Is this an expected finding or is there something wrong with this? I have tried different values for different samples and the number of reads for aligned fastq is never close to what I expect. I haven't seen anyone mentioning this issue among your issues and I cannot think of why I have this number of reads with the aligned reads. And am I right to use the aligned output instead of the unaligned one?

Thank you very much in advance, Hamdiye

huzuner commented 3 years ago

Just an update about the situation: So it looks like this problem only persists with bacterial read simulation and not the human read simulation. I think it is because I was using -dna_type to linear for bacterial read simulation (although I know that I should select circular). The reason I was doing this was because I was getting the error Do not choose circular if there is more than one chromosome in the genome!. I think all my problem is caused by this. However I have no idea how to fix this error. Could you please suggest a solution to my problem? Any help would be appreciated. Thanks a lot in advance!

Best, Hamdiye

kmnip commented 3 years ago

The proportion of unaligned reads (i.e. 98000/100000 = 98%) seems rather high in my opinion.

Did you use the pre-trained model for simulation? Or did you characterize your own nanopore sequencing data?

The unaligned reads are meant for representing garbage reads from the basecallers. The unaligned and aligned reads used to be outputted within the same file. We decided to split them into separate files to give more flexibility for the users. Depending on what you do with the simulated reads, you can ignore the unaligned reads.

huzuner commented 3 years ago

There is no problem with the number of reads produced for unaligned reads, although I still don't know what to do with them. I think I need to use aligned reads (because I think that file is meant for the actual goal of the generation of nanopore long read samples). However, how is the number of reads so low (~1700)?

I am training my own models with the commands:

python {params.nanosim}/read_analysis.py genome -i {input.read} -rg {input.r} --num_threads {threads} -o results/nanosim_train/{bac_ref}/{bac_ref}

I am simulating reads with the commands:

simulator.py genome -rg ../Simulation/resources/bac_refs/GCF_000622225.1_ASM62222v1_genomic.fna -c ../Simulation/results/nanosim_train/GCF_000622225.1_ASM62222v1/GCF_000622225.1_ASM62222v1 -b guppy --num_threads 2 --fastq -o test2 -dna_type linear -n 100000

Changing the-dna_type to "circular" results in the error message Do not choose circular if there is more than one chromosome in the genome! . I surely know that I should select the circular but the program does not just allow me to do this. And when I select the linear, there is no error, instead it gives a smaller number of reads for aligned reads than expected.

SaberHQ commented 3 years ago

Hello @huzuner ,

As @kmnip clearly explained and as you correctly mentioned, you should usually use the "aligned" reads for your downstream analysis. The "aligned" reads contain the correct error rates, read length distribution, and also follows the expression levels (in case of transcriptome simulation).

The first step in characterization phase is to align reads into reference genome. NanoSim then uses that alignment file to model the error rate, read length distribution, percent of reads being aligned and bunch of other metrics. As you noted, the percent of aligned reads are so low in your case and it means that most of the reads are not being aligned to the reference genome you used in the training step.

It might be due to several reasons. one might be the circular/linear reference genome selection. Perhaps the trick @molleraj used in the following issue might help you. By concatenating all amplicons into one sequence, the simulation was successful. Please try that approach and let us know if that helps.

https://github.com/bcgsc/NanoSim/issues/112#issuecomment-805194843

I have a question for you. Are you using the same reference genome in both characterization and simulation steps? Not saying that you have to, but I am just trying to understand the issue in your case. How many chromosomes are there in your genome?

We are here to help you get the most out of NanoSim pipeline. Let us know if you have more questions.

cheny19 commented 3 years ago

Hi @huzuner,

The aligned vs unaligned proportion seems odd to me. Could you show me the content in training_reads_alignment_rate? This file tells the alignment ratio of the training reads.

As for the linear / circular issue, I guess it's because your reference genome fasta file contains multiple entries for both the genome and plasmids. I'd suggest you use only the entry for genome for your simulation, if you do not need reads from plasmids. If you need both, you can simulate them separately.

Hope that helps.

Chen

huzuner commented 3 years ago

Hello,

Thank you for your comments.

There was definitely a problem with the characterization step that I do for my bacterial genomes of choice. I have generated models for each of my genomes but the simulation with these models were unfortunately unsuccessful because I have never gotten valid fastq files that contain desired number of reads. About the linear / circular issue, as @cheny19 said, I have used only one circular chromosome and thrown the rest of the plasmids for both the characterization and simulation for each genome. Although I did not have any errors, the resulting reads were not again what I wanted. And to answer to @SaberHQ's question, yes, all that time I was using the same reference genome in both characterization and simulation steps.

However, interestingly, when I wanted to try the model that I have created with E.coli genome as for the training model for the simulation of my other bacterial genomes, it worked! I could generate fastq files with E.coli as the training model and the reference fasta for my bacteria of interest (but of course they should have contained only one circular chromosome for each bacteria). Kraken2 confirmed these simulated reads and it quantified and identified almost exact amount of reads in my samples. It was very interesting. Perhaps you can tell me how it resulted in this way. As a side note, I have been using an E.coli sample as input read for the training. Maybe this would explain how I got a successful model. However, I haven't seen any information on README or the paper to state to avoid using input reads coming from a different organism in the training.

cheny19 commented 3 years ago

I see where the problem is. You need to train the model with ONT reads and reference genome. When we talk about reference genome, we mean the source genome where the reads are supposed to come from. This is because NanoSim relies on alignment to determine many features of the reads (read lengths, alignment ratio, etc). So if you use a different genome as reference, of course the number of aligned reads is very small. When you simulate, however, you should just use any reference genomes as you wish the reads to be from.

A side note is, it seems you want to simulate multiple bacterial genomes. You can also try our metagenome mode, which can simulate circular genomes when there are multiple entries in one FASTA file.

huzuner commented 3 years ago

So, in the end does it mean that bacterial genome simulation using training data belonging to another organism (which is E. coli in this case) is completely fine to simulate samples? Do you think that this is completely fine?

About the metagenome mode, I have multiple bacterial genomes but it is important for my design to exactly know the number of reads that I have in my samples because in the following I will mix them and then quantify them using tools like kraken2 and so on.

kmnip commented 3 years ago

Training with a different organism of interest is fine because the characterization stage in NanoSim is meant for learning the features of your sequencing platform. For example, if your reads were from E. coli, then you must use a E.coli reference genome during training. After that, you can use this trained model to simulate reads for another organism.

For metagenome simulation, you can specify the number of reads and abundance levels with the -a option. See the README on "sample abundance file for metagenome_simulation": https://github.com/bcgsc/NanoSim

huzuner commented 3 years ago

Thanks a lot for the detailed explanation!

Best, Hamdiye

bcgsc / NanoSim

Understanding the output of simulator.py #113