LooseLab / Icarust

A fully featured MinKNOW simulator for testing read until experiments.
Mozilla Public License 2.0
17 stars 7 forks source link

Inverting the simulated genomes #25

Open SimiliSerpent opened 3 months ago

SimiliSerpent commented 3 months ago

Hi Rory,

I am simulating SARS-CoV-2 diluted in a bacterial environment. My configuration file looks as follows :

output_path = "$SIM_DIR/pod5_files"
target_yield = $TARGET_YIELD
pore_type = "R10"
nucleotide_type = "DNA"

[parameters]
sample_name = "test"
experiment_name = "sim_$SIMULATION_ID"
flowcell_name = "FAQ1234"
experiment_duration_set = 10240000
device_id = "Bantersaurus"
position = "FenceSitter"
sample_rate = 5000

[[sample]]
name = "NC_045512"
input_genome = "$SIM_DIR/ref/${SIM_VIRUS_REF}.fasta"
mean_read_length = $SIM_VIRUS_LEN
weight = $SIM_VIRUS_W
amplicon = false

[[sample]]
name = "U00096_3"
input_genome = "$SIM_DIR/ref/${SIM_NOISE_REF}.fasta"
mean_read_length = $SIM_NOISE_LEN
weight = $SIM_NOISE_W
amplicon = false

For instance, let's say w = 1 for virus and w = 150 for bacteria. However, sometimes the weights for virus and bacteria are inverted by Icarust. I see it because I selectively filter out all DNA different from the COVID19 DNA with Readfish; sometimes, almost no reads are filtered. I check after the run, and indeed find that Icarust only generated 1/151 bacterial reads and 150/151 viral reads.

If I restart the simulation without changing anything, everything works fine! So it is not that big a deal (I have to monitor the start of each simulation, and restart if necessary). But it is a bit worrying and definitely an unexpected behavior. It happens randomly, and I witnessed the issue in different simulation environment (different lab clusters).

Do you have any clues why that is? Does chance intervene at some point in the choice of the weights?

I hope you are doing well and thank you for your help. Sincerely Ben

Adoni5 commented 3 months ago

hey @SimiliSerpent, sorry for the slow reply, I've just been on holiday for two weeks! This shouldn't be happening, no doubt. It definitely seems like they're being switched in the code, i assume they are hardcoded in the actual TOML file?

Rory

SimiliSerpent commented 1 month ago

No worries, I was busy watching the olympics and paralympics! Well, it is my turn to say sorry for the delay. Yes, the simulated genomes are hardcoded in the TOML file. I will look deeper in the source code if I have a chance, and repost here if I find anything. If any other Icarust user encounters the same issue, please let me know!

Best Ben