arshajii commented 7 years ago

Is it possible to allow smaller simulations (i.e. smaller -x)? At the moment, I receive the message

The value of -x should be set between 400 and 800

I have tried using -o (with -x 1 and -x 5), but the program seems to hang:

...
Tue Jan 10 13:06:35 2017: DWGSIM round 0 thread 3 end
Tue Jan 10 13:06:35 2017: cat sim.dwgsim.0.3.12.fastq >> sim.dwgsim.0.12.fastq
[dwgsim_core] 187500
[dwgsim_core] Complete!
Tue Jan 10 13:06:53 2017: DWGSIM round 1 thread 1 end
[dwgsim_core] 187500
[dwgsim_core] Complete!
Tue Jan 10 13:06:53 2017: DWGSIM round 1 thread 2 end
[dwgsim_core] 187500
[dwgsim_core] Complete!
Tue Jan 10 13:06:58 2017: DWGSIM round 1 thread 0 end
Tue Jan 10 13:06:58 2017: cat sim.dwgsim.1.1.12.fastq >> sim.dwgsim.1.12.fastq
Tue Jan 10 13:06:58 2017: cat sim.dwgsim.1.2.12.fastq >> sim.dwgsim.1.12.fastq
Tue Jan 10 13:06:58 2017: cat sim.dwgsim.1.3.12.fastq >> sim.dwgsim.1.12.fastq
Tue Jan 10 13:06:58 2017: Simulate reads start
Tue Jan 10 13:06:58 2017: Load barcodes start
Tue Jan 10 13:07:00 2017: Load barcodes end
Tue Jan 10 13:07:00 2017: readPairsPerMolecule: 0
Tue Jan 10 13:07:00 2017: Simulating on haplotype: 0
Tue Jan 10 13:07:00 2017: Load read positions haplotype 0
Tue Jan 10 13:07:09 2017: 0 reads failed being loaded.
Tue Jan 10 13:07:09 2017: Exporting sim.0.fp
Tue Jan 10 13:08:35 2017: Exported sim.0.fp
Tue Jan 10 13:08:35 2017: readsCountDown: 500000   (stuck here)

My reference is hg19.

aquaskyline commented 7 years ago

Hi, would you please provide me all the parameter you've been using, thanks.

arshajii commented 7 years ago

I used just -r hg19.fa -p sim -x 1 -o.

aquaskyline commented 7 years ago

I see, that's because you simulated too few reads so the simulator keeps sweep through the 3G slots array to search for usable reads as requested by a formula on three other parameters -f, -t and -m. With this few reads, please reduce -f, -t and -m to their lowest possible values.

arshajii commented 7 years ago

Thanks for your help. I tried -x 1 -m 4 -t 3 -o and it seems to have worked. Just to clarify, the first 16bp of the R1 FASTQs are the barcode, right?

arshajii commented 7 years ago

The ~~two~~ one issues I'm facing now are:

There seems to have been only one read pair generated for each barcode (unless the first 16bp of the first mate is not the barcode).
~~The coverage seems skewed; for instance, >50% of reads are from chr1.~~ (edit: I realized reads from the same chromosome are not actually grouped in the outputted FASTQs)

aquaskyline commented 7 years ago

first 16bp of the R1 FASTQs are the barcode, right? RE: correct.

aquaskyline commented 7 years ago

you have only generated 1M reads, which means that you need to sweep for 3000bp in average to get the next available read randomly generated from the human genome. You can change the code from

define CHK_PREV_SLOT_LIMIT (10*AMP_ON_SLOTS)

to

define CHK_PREV_SLOT_LIMIT (4000*AMP_ON_SLOTS)

in order to fulfill your parameters, the side effect is that the problem will run much slower.

Notice that for the default parameter value 600M pairs for a human genome, which converts to ~40-fold, the original value 10 is safe and efficient.

aquaskyline commented 7 years ago

a suggestion for testing out the simulator without working on huge genomes is to use smaller genomes such as S. Cerevisiae or Arabidopsis. 10X's supernova assembler costs 2 hours to finish the assembly of 40x of arabidopsis data.

arshajii commented 7 years ago

Changing the CHK_PREV_SLOT_LIMIT parameter seems to have worked. Thanks again for your help!

aquaskyline / LRSIM

Allow smaller simulations #2

define CHK_PREV_SLOT_LIMIT (10*AMP_ON_SLOTS)

define CHK_PREV_SLOT_LIMIT (4000*AMP_ON_SLOTS)