when only 100 telo reads, results are unstable

byb121 commented 5 years ago

Telomerecat log:

 Generating TELBAM from: /home/yaobo/temp_data/slice_of_PD43997c.sample.dupmarked.bam
    - TELBAM generation started 2019-11-22 15:52:11
    - This run will output the following files:
        + slice_of_PD43997c.sample.dupmarked_telbam.bam
    * Total:443678 telbam:104 | Time: 5
    - Read pairing in progress: 100.00% complete
    - Total reads processed: 443678
    - TELBAM generation finished 2019-11-22 15:53:03

 Collecting meta-data for all samples | 2019-11-22 15:53:03
    - slice_of_PD43997c.sample.dupmarked.bam | 2019-11-22 15:53:03

 Commencing length estimation | 2019-11-22 15:53:22
    - slice_of_PD43997c.sample.dupmarked.bam | 2019-11-22 15:53:22

 Length estimation results written to the following file:
    ./py2_2_small_test_telo.csv

-------------------------------------------------------------------

Telomerecat results

$ cat ~/temp_data/py2_small_test_telo.csv
Sample,F1,F2,F4,Psi,Insert_mean,Insert_sd,Read_length,Initial_read_length,F2a,F2a_c,Length
slice_of_PD43997c.sample.dupmarked.bam,18,5,2,0.201,334.0,98.459,151,151,3,3,1414.4

$ cat ~/temp_data/py2_2_small_test_telo.csv
Sample,F1,F2,F4,Psi,Insert_mean,Insert_sd,Read_length,Initial_read_length,F2a,F2a_c,Length
slice_of_PD43997c.sample.dupmarked.bam,18,5,2,0.201,334.0,98.459,151,151,3,3,1632.4

$ cat ~/temp_data/py3_small_test_telo.csv
Sample,F1,F2,F4,Psi,Insert_mean,Insert_sd,Read_length,Initial_read_length,F2a,F2a_c,Length
slice_of_PD43997c.sample.dupmarked.bam,18,5,2,0.201,334.32,98.459,151,151,3,3,1404.7

$ cat ~/temp_data/py3_2_small_test_telo.csv
Sample,F1,F2,F4,Psi,Insert_mean,Insert_sd,Read_length,Initial_read_length,F2a,F2a_c,Length
slice_of_PD43997c.sample.dupmarked.bam,18,5,2,0.201,334.32,98.459,151,151,3,3,1401.6

byb121 commented 5 years ago

It probably still is the case when there're more reads.

byb121 commented 5 years ago

This is due to some error model used in the code, and seems acceptable by some users. Leave it for now.

mfoll commented 4 years ago

I have a similar issue with much more reads (WGS at 30X coverage), and the estimated length varies in a large range (7kb to 13kb) across runs of telomerecat. I tried to increase the -N option:

  -N INT, --simulator_runs INT
                        The amount of times to run the length simulator.
                        A higher number better captures the uncertainty 
                        produced by the insert length
                        distribution [Deafult 10]

but it doesn't seem to be more stable (the last two lines are with -N 10000):

Sample,F1,F2,F4,Psi,Insert_mean,Insert_sd,Read_length,Initial_read_length,F2a,F2a_c,Length
60820188484223.bam,27738,3752,3109,1.837,357.0,128.664,151,151,643,643,9619.0
60820188484223.bam,28489,3676,2808,1.837,357.0,128.664,151,151,868,868,7215.91
60820188484223.bam,28025,3713,2990,1.837,357.0,128.664,151,151,723,723,8505.18
60820188484223.bam,27848,3704,3094,1.837,357.0,128.664,151,151,610,610,10052.293
60820188484223.bam,28068,3721,2932,1.837,357.0,128.664,151,151,789,789,7840.707
60820188484223.bam,28203,3768,2888,1.837,357.0,128.664,151,151,880,880,7088.456
60820188484223.bam,27440,3712,3269,1.837,357.0,128.664,151,151,443,443,13771.118
60820188484223.bam,27267,3820,3319,1.837,357.0,128.664,151,151,501,501,12035.8988
60820188484223.bam,28109,3714,2951,1.837,357.0,128.664,151,151,763,763,8109.6292
60820188484223.bam,27687,3701,3172,1.837,357.0,128.664,151,151,529,529,11550.6246

Any recommendation?

eliu98 commented 1 year ago

I have a similar issue with much more reads (WGS at 30X coverage), and the estimated length varies in a large range (7kb to 13kb) across runs of telomerecat. I tried to increase the -N option:

  -N INT, --simulator_runs INT
                        The amount of times to run the length simulator.
                        A higher number better captures the uncertainty 
                        produced by the insert length
                        distribution [Deafult 10]

but it doesn't seem to be more stable (the last two lines are with -N 10000):

Sample,F1,F2,F4,Psi,Insert_mean,Insert_sd,Read_length,Initial_read_length,F2a,F2a_c,Length
60820188484223.bam,27738,3752,3109,1.837,357.0,128.664,151,151,643,643,9619.0
60820188484223.bam,28489,3676,2808,1.837,357.0,128.664,151,151,868,868,7215.91
60820188484223.bam,28025,3713,2990,1.837,357.0,128.664,151,151,723,723,8505.18
60820188484223.bam,27848,3704,3094,1.837,357.0,128.664,151,151,610,610,10052.293
60820188484223.bam,28068,3721,2932,1.837,357.0,128.664,151,151,789,789,7840.707
60820188484223.bam,28203,3768,2888,1.837,357.0,128.664,151,151,880,880,7088.456
60820188484223.bam,27440,3712,3269,1.837,357.0,128.664,151,151,443,443,13771.118
60820188484223.bam,27267,3820,3319,1.837,357.0,128.664,151,151,501,501,12035.8988
60820188484223.bam,28109,3714,2951,1.837,357.0,128.664,151,151,763,763,8109.6292
60820188484223.bam,27687,3701,3172,1.837,357.0,128.664,151,151,529,529,11550.6246

Any recommendation?

were you able to solve this problem?

cancerit / telomerecat

when only 100 telo reads, results are unstable #2