Change read length distribution simulated reads

FiniDG commented 6 years ago

I don't think it is possible, but I wanted to change some parameters in my simulated reads, to test different factors. One of the parameters I really want to change is the read length like max, average, min. I tried changing some parameters in the ecdf files (_aligned_reads_ecdf, _aligned_length_ecdf and _unaligned_length_ecdf) by changing the bins and the percentages after them, but that doesn't seem to work. Only changing _aligned_reads_ecdf (what I tried first) doesn't even seem to do anything.

Do you know how I could possibly change the read lengths of the reads I am going to simulate? I believe that the error rates should be enough to make the reads? I guess they do not change according to the size of the reads (that is only library preparation)?

harisankar991 commented 5 years ago

Any leads on this?

FiniDG commented 5 years ago

nope, no answer from the developer yet. I tried playing with the ecdf files, but it is hard to understand how these three correlate with each other. You could technically train the set multiple times, by changing the reads in the sam file, but that takes way too much time and you are limited by the available size lengths of the sam file

cheny19 commented 5 years ago

You can control the max and min length of your simulated reads by using the options --max_len and --min_len. But in doing this, the length distribution of your reads will change as well. And yes, if the error rates should be independent from read length, because the error rate is 1 - identical bases/all bases

FiniDG commented 5 years ago

but in this way I cannot make the reads longer. By default max_len is inifinity, so it depends on the training data how long it will actually get. I would like to be able to make my "own" distribution, to simulate a different library prep or "future" data.

cheny19 commented 5 years ago

I see what you are trying to do here now. The ecdf file you need to use is _aligned_length_ecdf. NanoSim starts from this file to pick the aligned part length, and then add the unaligned part to its 5' and 3'. Basically the ecdf file has two columns, bins and probabilities. For example, the first row being 0-100 0.005 means that there's 0.005 chance for the aligned part of your reads to fall between 0-100 nt long. The second row being 100-200 0.01 means that if the random dice picks a probability between 0.005 and 0.01, then the length will fall between 100-200, linearly.

In your case, if you want a larger max_len than the training data, you need to make more bins or simply change the max length in the last row. If you make more bins, you need to change the probabilities in previous bins smaller, to make room for the newly added bins. Since it's a cdf file, the probability of the last bin should always be 1. Also remember to change the max length in the first row.

FiniDG commented 5 years ago

thanks for the reply. Your explanation is something that I figured by looking at the file myself, however changing it to something rediculous (like 0-100 0.99 and 100-10000 1.00) did not seem to work. (I wanted to visually check if the changes were indeed applied, so that was the reason). But I have another try tomorrow and let you know how it goes. Maybe I should use a program instead to check the simulated reads themselves on distribution, and not use weird/unrealistic bins.

I dont need to look at the other 2 ecdf files with the bins?

cheny19 commented 5 years ago

What do you mean by did not seem to work? The simulated reads have other lengths than the range you specified, or the distribution is messed? I never tried to change my ecdf files, so I'll also do some tests and see how it actually works.

If you are simulating perfect reads with --perfect option, then you should use _aligned_reads_ecdf instead, because perfect reads should be aligned perfectly without the head/tail unaligned parts. I suggest you start with simulating perfect reads, since errors on reads will also alter their length distributions. Once you see your altered distribution is reflected on your simulated reads, then you can remove --perfect option and simulate reads with errors.

FiniDG commented 5 years ago

I cannot get it to work as you described. I tried two different things:

changing "_aligned_length_ecdf"

bin 0-130690
0-50    0.9900000000000000
50-10000    0.9990000000000000
10000-130700    1.0000000000000000

With normal parameters (its about a 1Mb region): python simulator.py linear -r /path/to/genome_rearranged.fasta -c /path/to/training/ -o /path/to/output/ -n 3475 resulted in:

sum = 35303931, n = 3475, ave = 10159.40, largest = 133123
N50 = 27938, n = 266
N60 = 13517, n = 448
N70 = 8496, n = 807
N80 = 6724, n = 1270
N90 = 4334, n = 1905
N100 = 45, n = 3475

changing "_aligned_reads_ecdf"

bin 0-130690
0-50    0.9900000000000000
50-10000    0.9990000000000000
10000-130700    1.0000000000000000

With --perfect parameter python simulator.py linear -r /path/to/genome_rearranged.fasta -c /path/to/training/ -o /path/to/output/ -n 3475 --perfect

resulted in:

sum = 39454686, n = 3475, ave = 11353.87, largest = 130670
N50 = 55582, n = 216
N60 = 15535, n = 326
N70 = 8719, n = 742
N80 = 7139, n = 1240
N90 = 4894, n = 1894
N100 = 51, n = 3475

So it just ignores the numbers. Changing all ecdf files to the same bins as described above, results in an error:

  File "simulator.py", line 716, in <module>
    main()
  File "simulator.py", line 708, in main
    read_profile(number, model_prefix, perfect, max_readlength, min_readlength)
  File "simulator.py", line 166, in read_profile
    unaligned_dict = read_ecdf(u_profile)
  File "simulator.py", line 77, in read_ecdf
    ratio = [float(x) for x in new[0].split('-')]
ValueError: could not convert string to float:

"_unaligned_length_ecdg" was changed to this (because tot top row seems important?):

Aligned / Unaligned ratio:  4.85685302953
bin 0-130690
0-50    0.9900000000000000
50-10000    0.9990000000000000
10000-130700    1.0000000000000000

Edit: Changing all ecdf files, except "_unaligned_length_ecdf" gives no error, but still weird results that ignore the numbers (this was done with the --perfect parameter again):

sum = 40839607, n = 3475, ave = 11752.40, largest = 129894
N50 = 58911, n = 212
N60 = 17994, n = 319
N70 = 8737, n = 737
N80 = 7244, n = 1248
N90 = 5118, n = 1910
N100 = 53, n = 3475

cheny19 commented 5 years ago

It might be more useful if you could plot the length distribution of your simulated reads, rather than looking at the N50, N60,... numbers. But I'll do some tests and let you know.

FiniDG commented 5 years ago

Any leads already? (maybe I can also help if you have found something already)

cheny19 commented 5 years ago

Not yet, I'll keep you updated when I have some insights.

HenrivdGeest commented 5 years ago

Thanks cheny19 and FiniDG-HAN. for the kickoff. I tried making it work, and to me it seems to work. I changed 2 files:

geest@pc:[/tmp/nanosim]: cat ./training_mod/_aligned_reads_ecdf
bin 0-500000
0-50    0.1
50-100  0.2
100-50000   0.5
50000-500000    1
geest@pc:[/tmp/nanosim]: cat ./training_mod/_aligned_length_ecdf
bin 0-500000
0-50    0.1
50-100  0.2
100-50000   0.5
50000-500000    1

With the following simulator command I get much longer reads: python simulator.py linear --min_len 50000 --max_len 150000 -r reference.fasta -c ./training_mod/ -n 10 The resulting simulated reads.fasta contains:

Number_of_sequences:    10
Total_sequence_length:  996830
Average_sequence_length:    99683
Std._dev._sequence_length:  19566
Minimum_sequence_length:    51241
Maximum_sequence_length:    127886

This works with and without the --perfect parameter. However, it seems to take much longer to generate these long reads, not sure because its just that complicated or if the internal generater creates reads outside the boundaries which are being trashed internally, causing much longer runtimes.

For me the questions left are: What does the "_first_match.hist" file do? the top row says 0-50000, whilst the last line is "920-930 1.0" Does that represent the probability of where the match started, so somewhere in the first 1000bp for sure?

For the file _match_markov_model I have no idea

The file '_unaligned_length_ecdf' does make some sense as it specifies the unaligned lengths for the simulated reads. However I am not sure if the simulated reads will show parts being softclipped if aligned to the reference. edit: I checked aligments of the simulated reads, and softclipping and the tails is just minor. Although I see a few reads being unmapped, but the read name itself is already *unalinged. The last file required is the _ht_ratio file. I am not certain about the longer simulated reads not being affected by the unchanged "_ht_ration,_ualigned_length_ecdf, _match_markov and first_match.hist " files. Cheny19, can you say something on this?

FiniDG commented 5 years ago

Thank you @HenrivdGeest I will have a look and let you know.

EDIT: At first I thought that what you did was not different from what I have tried, except for setting the --min_len and --max_len. I could not get my reads longer than the longest read of the .sam file, but I now just found out that this is because I didn't increase the largest bin.

I still don't know why my bins did not work (99% of the reads should be 0-50 bp), but using @HenrivdGeest method the bins in the ecdf files DO seem to work, but instead he opted for the other way around (creating only very large reads, instead of small to check the data). If I apply his method, I get the desired read lengths, even if I leave out the --min_len and --max_len. So this suggests that creating only short reads does not work, maybe because it wants a minimal coverage of 1 or something?

ecdf_files

bin 0-500000
0-50    0.1
50-100  0.2
100-50000   0.5
50000-500000    1

results from this query: python simulator.py linear --min_len 200000 --max_len 500000 -r /path/to/reference1MB.fasta -o /path/to/output/ -c /path/to/training/ -n 100 histogramreadlength

results from this query: python simulator.py linear -r /path/to/reference1MB.fasta -o /path/to/output/ -c /path/to/training/ -n 100 histogramreadlength

So It does seem to follow the bin files now? (around 30 % in the 50-100 range, 20% in the 100-50000 range and 50% in the 50000-500000 range)

I do have to do further tests, to see how the data looks if I make a more reasonable distribution and generate more than 100 reads. However I am now concerned about some things: In general, The training is made with the .sam file, which had reads with maximum length of 237.800. What will happen to all the training parameters after this length? For example the _unaligned_length_ecdf does not seem to change anything (like @HenrivdGeest said) and maybe there are other files that regulate certain properties of the reads of specific length-ranges. So what will happen to reads bigger than the range of the original .sam file, that made the training? (in this case bigger than 237.800)?

cheny19 commented 5 years ago

Hi @FiniDG-HAN, sorry for the late reply. If you want to have reads longer than the range in the training set, you have to change both ecdf files (aligned and unaligned).

I've been testing the read length distributions lately and we decide to use Kernel Density functions to simulate the read lengths instead of the ECDF (bin files). If you want to have control over the read lengths, you can use median_len, --sd_len, max_len, and min_len. When you set median_len and --sd_len, NanoSim will generate read lengths following lognormal distribution for both aligned reads and unaligned reads. Please download the latest release and give it a shot. I'd love to hear your feedback.

FiniDG commented 5 years ago

Thanks @cheny19 for your reply! Looks promising to use in combination with min_len and max_len. However, in the coming weeks I wont have any time to check this new version, because of the holidays and other priorities. However, I will certainly have a look around mid Januari and I will let you know how it works out.

I will let you decide if this issue should be closed now or later

Thanks for the feedback and support last few weeks!

FiniDG commented 5 years ago

Hello @cheny19 I had time to look into the new version. But I encountered some problems. I made the training folder successfully, using the same files as before. The simulation stage is where it goes wrong with the data I want to simulate. I was hoping you could help me further.

I used the median_len, sd_len, max_len and min_len parameters, derived from real data, to simulate Nanopore data using NanoSim. python simulator.py linear -r /home/user/sequence.fasta -c /home/user/training/ -o /home/user/READS/ -n 197459 --median_len 12867 --sd_len 8957 --max_len 77947 --min_len 50

I get this error:

Traceback (most recent call last):
  File "simulator.py", line 737, in <module>
    main()
  File "simulator.py", line 728, in main
    max_readlength, min_readlength, median_readlength, sd_readlength)
  File "simulator.py", line 356, in simulation
    ref = int(ref_l[j])
OverflowError: cannot convert float infinity to integer

So I guess something is wrong with my parameters, but it is derived from real long sequencing data, so I don't really understand where I got it wrong. I guess somewhere, because of the parameters, an infinity number is made? Some reads were still simulated, but in later steps of my workflow, it will generate errors that there are too few reads.

Thanks again for the help!

edit: I will try some other parameters myself, to see if that is really the problem

Here is also the .log file (just discovered)

2019-01-08 15:50:20: Read error profile
2019-01-08 15:50:20: Read KDF of unaligned reads
2019-01-08 15:50:20: Read KDF of aligned reads
2019-01-08 15:50:20: Simulating read length with log-normal distribution
2019-01-08 15:50:20: Read in reference genome
.
2019-01-08 15:50:21: Start simulation of random reads
2019-01-08 16:48:28: Start simulation of aligned reads

Edit2: I tried several other parameters now, including

-n 38944 --median_len 64688 --sd_len 48010 --max_len 250000 --min_len 50
-n 150000 --median_len 5368 --sd_len 11085 --max_len 269087 --min_len 5

I keep getting the same error after several hours and the log files always stop at the "Start simulation of aligned reads" stage

cheny19 commented 5 years ago

Hi @FiniDG-HAN ,

The other parameters are OK, except for --sd_len. It is the standard deviation in log-normal distribution, so it usually ranges between 0 to 1. Giving it such a large number will result in infinity read length, because it's log normal. I'd suggest try 1 first and see how it goes, then you can tweak it for your need.

FiniDG commented 5 years ago

Hello @cheny19 Sorry for all the (maybe stupid) questions. I really appreciate the support.

I tried setting --sd_len to 1, but after 4 days it was still running. Since Thursday, the reads were 57.7 Gb (which was already much higher than I expected using the settings), but today (Monday) the reads are still 57.7Gb and the .log file still showed " Start simulation of aligned reads". So after no changes in 3 days I interrupted the simulator.py with this result:

python simulator.py linear -r /home/user/sequence.fasta -c /home/user/training/ -o /home/user/reads/ -n 197459 --median_len 12867 --sd_len 1 --max_len 77947 --min_len 50

^CTraceback (most recent call last):
  File "simulator.py", line 737, in <module>
    main()
  File "simulator.py", line 728, in main
    max_readlength, min_readlength, median_readlength, sd_readlength)
  File "simulator.py", line 375, in simulation
    new_read, new_read_name = extract_read(dna_type, middle_ref)
  File "simulator.py", line 438, in extract_read
    ref_pos = random.randint(0, genome_len)
  File "/usr/lib64/python2.7/random.py", line 241, in randint
    return self.randrange(a, b+1)
KeyboardInterrupt

I know I have to change the settings to lower values (57.7Gb was way to high), but I was still wondering why the program seemed to stop after Thursday. Is the ratio the values still unrealistic? I also tried to find what the SD of a log-normal distribution is, because I cannot find anything on the internet that it should be between 0 and 1? Taking the log of my earlier SD value (8957) results in 3.95.

cheny19 commented 5 years ago

Could you check your simulated reads and see how many reads are there already? It seems you were trying to simulate nearly 200k reads, so it would take quite some time. This is because the random number generator for log-normal distribution runs slower than generating random numbers from a kernel density function. Also, please check the release page, I have a sample --median_len and --sd_len there for one chemistry. You might start from there.

FiniDG commented 5 years ago

Hello @cheny19 ,

Sorry to bother you again, but I still do not quite follow the new parameters of this program. I trimmed down the parameters a lot, but I still get unexpected results.

parameters used: python simulator.py linear -r /home/user/sequence.fasta -c /home/user/training/ -o /home/user/reads/ -n 20000 --median_len 12867 --sd_len 0.3 --max_len 77947 --min_len 50

result:

sum = 18506444123, n = 20000, ave = 925322.21, largest = 36903986
N50 = 36889304, n = 251
N60 = 36888476, n = 301
N70 = 36887689, n = 352
N80 = 36886568, n = 402
N90 = 36885523, n = 452
N100 = 3515, n = 20000

and

General summary:        
Mean read length:               925,322.2
Median read length:              12,143.5
Number of reads:                 20,000.0
Read length N50:             36,889,304.0
Total bases:             18,506,444,123.0

histogramreadlength logtransformed_histogramreadlength

So I think I am doing something wrong, but I also do not know why or what it must be. The N50 seems to be really high to me, as it is almost as high as the max length. I also still do not understand the 0-1 sd value, because I believe this might be the problem. I think that this is the last puzzle piece I need. I really appreciate the help!

After this I will just use the default values that are from the training and skip to make my own parameters work. Maybe it is just incompatible. Using the training without changing parameters works very good, so I just continue with that.

Thanks again for all the help along the way!

cheny19 commented 5 years ago

Hi @FiniDG-HAN, I tried your command and they work fine for me. I don't see any weird lengths larger than the preset --max_len. But I did fix a small bug that may generate negative lengh, so could download the latest version and give it a shot? Sorry about the inconvenience.

bcgsc / NanoSim

Change read length distribution simulated reads #39