RAHenriksen / NGSNGS

NGSNGS: Next generation simulator for next generation sequencing data
46 stars 4 forks source link

Option -c and -r #33

Closed nuzla closed 1 year ago

nuzla commented 1 year ago

I noticed some issue when using the option -c instead of -r

The below command produces reads with option -r

ngsngs -i Sample_3.fasta -r 600000 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read

    -> ngsngs version: 9a6331b (htslib: 1.18-8-g6806b2d) build(Aug 16 2023 05:03:57)
    -> Mycommmand: ngsngs -i Sample_3.fasta -r 600000 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read 
    -> The read cycle length is either provided (-cl): 0 or inferred from the quality profile dimension (-q1): 0
    -> Number of contigs/scaffolds/chromosomes in file: 'Sample_3.fasta': 1
    -> Seed used: 1692175
    -> Number of sampling threads used (-t): 1 and number of compression threads (-t2): 1
    -> Number of simulated reads: 600000 or coverage: 0.000000
    -> Default PCR duplicate value 1
    -> Number of nref 1 in file: 'Sample_3.fasta'
    -> Allocated memory for 1 chromosomes/contigs/scaffolds from input reference genome
    -> Chromosome name first 20:30000000-32000000 and length 2004715 and full length 2004715
    -> File output name is Sample_3.read_R1.fq.gz
    -> Thread 0 produced 600000 reads with a current total of 600000
    -> Number of reads generated by thread 0 is 600000 
    [ALL done] cpu-time used =  49.30 sec
    [ALL done] walltime used =  49.00 sec

But with -c option command produces empty read files.

ngsngs -i Sample_3.fasta -c 30 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read

    -> ngsngs version: 9a6331b (htslib: 1.18-8-g6806b2d) build(Aug 16 2023 05:03:57)
    -> Mycommmand: ngsngs -i Sample_3.fasta -c 30 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read 
    -> The read cycle length is either provided (-cl): 0 or inferred from the quality profile dimension (-q1): 0
    -> Number of contigs/scaffolds/chromosomes in file: 'Sample_3.fasta': 1
    -> Seed used: 1692175
    -> Number of sampling threads used (-t): 1 and number of compression threads (-t2): 1
    -> Number of simulated reads: 0 or coverage: 30.000000
    -> Default PCR duplicate value 1
    -> Number of nref 1 in file: 'Sample_3.fasta'
    -> Allocated memory for 1 chromosomes/contigs/scaffolds from input reference genome
    -> Chromosome name first 20:30000000-32000000 and length 2004715 and full length 2004715
    -> File output name is Sample_3.read_R1.fq.gz
    -> Number of reads generated by thread 0 is 0 
    [ALL done] cpu-time used =  0.02 sec
    [ALL done] walltime used =  0.00 sec

Head of my seqeunce file

head Sample_3.fasta
>20:30000000-32000000
GATATTTGGAGCGCTTTGAGGCCTATGGTGGAAAAGGAAACACCTTCACAAAAAAACTAG
AGCAGAAGCATTCTCAGAAACGTCTTTGTGATGTGTGCATTCAACTCACAGAGTTGAACC
TTTCTTTGATAGAGCAGTTTTGAAACACTCTTTTTGTAGAATCTGCAGTTGGATATTTGG
AGCGCTTTGATGCCTATGGTGGAAAAGGAAATATCCGCACATAAAAACTAGACAGCAGCA
TTCTCAGAAACTTGTTTGTGTTGTGTGCATTCAACTCACAGAGTTGAGCTTTCCTTTGAT
TGAGCAGTTTTGAAAAAGTCTTTTTGCAAAATCTGCAAGTGGATATTTGGGGCGGTTTGA
GGCCTATGGTGTAAAAGGAAATATCTTCACATAAAAACTAGACAGAAGCATTCTCTGAAA
CTTCTTTGTGATGTGTGAATTCAACTCACAGAGTTGAACCTTTCTTTTGTAGAGCAGTTT
TGAAACTCTTTTTGTAGAATCTGTAAGTAGATATTTGGAGCGCTTTGAGGCTTATGGTGG

Am I missing some parameters?

RAHenriksen commented 1 year ago

Dear nuzla,

Thanks for your comment. No you're using the correct parameters, it was a small bug which has been fixed in the most recent commit (d6640c3). So now if I am running the following commands i do get sequencing reads, with small examples below.

./ngsngs -i Test_Examples/Mycobacterium_1leprae.fa.gz -c 0.5 -l 100 -seq SE -qs 30 -f fq.gz -o MycoBactBamSEOut

@T0_RID42_S1_NZ_CP029543.1:477642-477741_length:100_mod0000 F0 R1 TCAGCGGCACTAAAGCCAGTGCGGTCATTCACCCTGCTTAGCTAACTCCCTGTACCAAACAACAAACACCTGAAACAAAGTTTTGAGAACGCCCTATGAG + ????????????????????????????????????????????????????????????????????????????????????????????????????

./ngsngs -i Test_Examples/Mycobacterium_leprae.fa.gz -r 100 -l 100 -seq SE -qs 30 -f fq.gz -o MycoBactBamSEOut

@T0_RID46_S0_NZ_CP029543.1:2807446-2807545_length:100_mod0000 F0 R1 TCGTAAGCCAACAGTAGCAGAGCGACCACCAACGCCACCACGGCCGAGGTGAACGCCCAGTCGGAGTAGCACGCCAGGCCGATGTTGACGTGCAGAGTGT + ????????????????????????????????????????????????????????????????????????????????????????????????????

./ngsngs -i Test_Examples/Mycobacterium_leprae.fa.gz -c 0.5 -l 100 -seq PE -qs 30 -f fq.gz -o MycoBactBamSEOut

@T0_RID24_S1_NZ_CP029543.1:688855-688954_length:100_mod0000 F0 R1 CGCGATAGGACCGAAACTCCCGGAGTTGTGGGGCGGGTCGGCCGATCTAGCGGGCAGCAACAACACCACCATAAAAGACGTTGATTCCTTTGGGCCGCCG + ????????????????????????????????????????????????????????????????????????????????????????????????????

@T0_RID24_S1_NZ_CP029543.1:688855-688954_length:100_mod0000 F0 R2 CGGCGGCCCAAAGGAATCAACGTCTTTTATGGTGGTGTTGTTGCTGCCCGCTAGATCGGCCGACCCGCCCCACAACTCCGGGAGTTTCGGTCCTATCGCG + ????????????????????????????????????????????????????????????????????????????????????????????????????

./ngsngs -i Test_Examples/Mycobacterium_leprae.fa.gz -r 100 -l 100 -seq PE -qs 30 -f fq.gz -o MycoBactBamSEOut

@T0_RID10_S0_NZ_CP029543.1:601002-601101_length:100_mod0001 F0 R1 GCGCTGATCTCGCAGGCGACCGGGCACCGGTTCGAGGCACTGTCGGCGATGTCGGCGGGTGTCAAAGACGTTCGAGCTGTTTTGAAGATTGCACGTAGTG + ????????????????????????????????????????????????????????????????????????????????????????????????????

@T0_RID10_S0_NZ_CP029543.1:601002-601101_length:100_mod0001 F0 R2 CACTACGTGCAATCTTCAAAACAGCTCGAACGTCTTTGACACCCGCCGACTACGCCGACAGTGCCTCGAACCGGTGCCCGGTCGCCTGCGAGATCAGCGC + ????????????????????????????????????????????????????????????????????????????????????????????????????

Let me know if you encounter other issues.

Best, Rasmus

nuzla commented 1 year ago

Thanks for the update. Now it works.

Could you give me some more clarification? I need 30x coverage with 100-length reads. Can you verify these read counts?

Are the -c 30 -l 100 options correct for my requirement?

Paired reads -> 300707

ngsngs -i Sample_3.fasta -c 30 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read

    -> ngsngs version: d6640c3 (htslib: 1.18-8-g6806b2d) build(Aug 17 2023 22:06:12)
    -> Mycommmand: ngsngs -i Sample_3.fasta -c 30 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read 
    -> The read cycle length is either provided (-cl): 0 or inferred from the quality profile dimension (-q1): 0
    -> Number of contigs/scaffolds/chromosomes in file: 'Sample_3.fasta': 1
    -> Seed used: 1692266
    -> Number of sampling threads used (-t): 1 and number of compression threads (-t2): 1
    -> Number of simulated reads: 300707 or coverage: 30.000000
    -> Default PCR duplicate value 1
    -> Number of nref 1 in file: 'Sample_3.fasta'
    -> Allocated memory for 1 chromosomes/contigs/scaffolds from input reference genome
    -> Chromosome name first 20:30000000-32000000 and length 2004715 and full length 2004715
    -> File output name is Sample_3.read_R1.fq.gz
    -> Thread 0 produced 300707 reads with a current total of 300707
    -> Number of reads generated by thread 0 is 300707 
    [ALL done] cpu-time used =  21.42 sec
    [ALL done] walltime used =  21.00 sec

Single Ended -> 601414

ngsngs -i Sample_3.fasta -c 30 -l 100 -seq SE -qs 30 -f fq.gz -o Sample_3.read

    -> ngsngs version: d6640c3 (htslib: 1.18-8-g6806b2d) build(Aug 17 2023 22:06:12)
    -> Mycommmand: ngsngs -i Sample_3.fasta -c 30 -l 100 -seq SE -qs 30 -f fq.gz -o Sample_3.read 
    -> The read cycle length is either provided (-cl): 0 or inferred from the quality profile dimension (-q1): 0
    -> Number of contigs/scaffolds/chromosomes in file: 'Sample_3.fasta': 1
    -> Seed used: 1692267
    -> Number of sampling threads used (-t): 1 and number of compression threads (-t2): 1
    -> Number of simulated reads: 601414 or coverage: 30.000000
    -> Default PCR duplicate value 1
    -> Number of nref 1 in file: 'Sample_3.fasta'
    -> Allocated memory for 1 chromosomes/contigs/scaffolds from input reference genome
    -> Chromosome name first 20:30000000-32000000 and length 2004715 and full length 2004715
    -> File output name is Sample_3.read.fq.gz
    -> Thread 0 produced 601414 reads with a current total of 601414
    -> Number of reads generated by thread 0 is 601414 
    [ALL done] cpu-time used =  21.97 sec
    [ALL done] walltime used =  22.00 sec
RAHenriksen commented 1 year ago

Yes your parameters are correct.

When providing the desired coverage -c, the number of reads to be simulated in order to fulfill this coverage is estimated based on the reference genome length and mean read length. So with -l the mean fragment length is likewise fixed. When using the distribution -ld or file -lf the mean length will be calculated.

So for your example with a reference genome length of 2004715 and coverage of 30 you would need to coverage 200471530 nucleotides. So with a desired read length of 100 we would need (200471530)/100 = 601414.5 reads to cover this number of nucleotides.

Then for PE reads, each file will contain half. One way to verify this could be to directly save it as a sam file and determine the coverage. or align the fastq file and do the same.

./ngsngs -i Test_Examples/Mycobacterium_leprae.fa.gz -c 10 -l 100 -seq PE -qs 30 -f sam -o MycoBactBamPEOut samtools sort MycoBactBamPEOut.sam -o MycoBactBamPEOutSort.sam

samtools coverage MycoBactBamPEOutSort.sam

rname startpos endpos numreads covbases coverage meandepth meanbaseq meanmapq

NZ_CP029543.1 1 3187112 318710 3186875 99.9926 9.99974 30 60

Where the meandepth in this case is 9.99974 given the input coverage of 10.

When simulating fastq file, you might experience upon alignment the coverage might not perfectly suit the input anymore due to unaligned reads or reads you might filter out before proceeding with your analysis.

nuzla commented 1 year ago

Thanks for the clarification.

<<Then for PE reads, each file will contain half. >>

However, in the below case, each output read file has 600000 reads. Shouldn't it be 300000?

ngsngs -i Sample_3.fasta -r 600000 -l 100 -seq PE -qs 30 -f fq.gz -o Sample_3.read
less Sample_3.read_1_R1.fq.gz | grep -c R1
600000

less Sample_3.read_1_R2.fq.gz | grep -c R2
600000
RAHenriksen commented 1 year ago

We made the choice to say when using -r the number of reads would either indicate 1 single-end read or 1 proper paired read, thus the R1 and R2 files will contain the same number of sequences to signify read 1 and read 2 in the proper pair.

Whereas we choose for the -c to be more in relation to how many times a given nucleotide would be covered.

I will look in the description and further elaborate.

/Rasmus