This takes reads starting with the 100,000th (tail -n+100000) and runs the distribution based on the first 10,000 (-N 10000) of those that meet the criteria for inclusion in pairend_distro.py.
For LUMPY Express, the insert size stats are calculated here with:
This takes reads starting with the 1st up to the 1,000,000th (gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }' and runs the distribution based on up to 1,000,000 (-N 1000000 of them that meet the criteria for inclusion in pairend_distro.py.
Each of these solutions seems to indicate different goals in generating the distribution. The LUMPY recommendations make an explicit effort to skip the first 100,000 reads in calculating the distribution while the LUMPY Express implementation uses these 100,000. The LUMPY Express implementation suggests that up to 1,000,000 reads are necessary to build the distribution which is 100x more reads than used in the LUMPY recommendations. Is either or both of these goals important in achieving a representative distribution? Neither distribution will be well-distributed throughout the genome since the bams have already been sorted at this point.
In my own quick testing, the distributions were more similar when using 1,000,000 reads than 10,000. It's also worth noting that (at least in my testing), gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }' was much slower than head -n 1000000 which achieves the same goal of capturing only the first 1,000,000 lines.
In the Example Workflows for LUMPY, the recommendation for calculating insert size statistics is:
This takes reads starting with the 100,000th (
tail -n+100000
) and runs the distribution based on the first 10,000 (-N 10000
) of those that meet the criteria for inclusion in pairend_distro.py.For LUMPY Express, the insert size stats are calculated here with:
This takes reads starting with the 1st up to the 1,000,000th (
gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }'
and runs the distribution based on up to 1,000,000 (-N 1000000
of them that meet the criteria for inclusion in pairend_distro.py.Each of these solutions seems to indicate different goals in generating the distribution. The LUMPY recommendations make an explicit effort to skip the first 100,000 reads in calculating the distribution while the LUMPY Express implementation uses these 100,000. The LUMPY Express implementation suggests that up to 1,000,000 reads are necessary to build the distribution which is 100x more reads than used in the LUMPY recommendations. Is either or both of these goals important in achieving a representative distribution? Neither distribution will be well-distributed throughout the genome since the bams have already been sorted at this point.
In my own quick testing, the distributions were more similar when using 1,000,000 reads than 10,000. It's also worth noting that (at least in my testing),
gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }'
was much slower thanhead -n 1000000
which achieves the same goal of capturing only the first 1,000,000 lines.