arq5x / lumpy-sv

lumpy: a general probabilistic framework for structural variant discovery
MIT License
314 stars 118 forks source link

Insert size calculations differ between LUMPY Express and recommendations for calling LUMPY directly #235

Open avakel opened 6 years ago

avakel commented 6 years ago

In the Example Workflows for LUMPY, the recommendation for calculating insert size statistics is:

samtools view -r readgroup1 sample.bam \
    | tail -n+100000 \
    | scripts/pairend_distro.py \
    -r 101 \
    -X 4 \
    -N 10000 \
    -o sample.lib1.histo

This takes reads starting with the 100,000th (tail -n+100000) and runs the distribution based on the first 10,000 (-N 10000) of those that meet the criteria for inclusion in pairend_distro.py.

For LUMPY Express, the insert size stats are calculated here with:

$PYTHON $BAMGROUPREADS --fix_flags -i $FULL_BAM -r ${LIB_RG_LIST[$j]} \
        | $SAMBLASTER --acceptDupMarks --excludeDups --addMateTags --maxSplitCount $MAX_SPLIT_COUNT --minNonOverlap $MIN_NON_OVERLAP \
            --splitterFile $TEMP_DIR/spl_pipe --discordantFile $TEMP_DIR/disc_pipe \
        | $SAMT view -S /dev/stdin \
        | gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }' \
        | $PYTHON $PAIREND_DISTRO -r $READ_LENGTH -X 4 -N 1000000 -o ${TEMP_DIR}/$OUTBASE.sample$(($i+1)).lib$(($j+1)).x4.histo \
        > ${TEMP_DIR}/$OUTBASE.sample$(($i+1)).lib$(($j+1)).insert.stats &

This takes reads starting with the 1st up to the 1,000,000th (gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }' and runs the distribution based on up to 1,000,000 (-N 1000000 of them that meet the criteria for inclusion in pairend_distro.py.

Each of these solutions seems to indicate different goals in generating the distribution. The LUMPY recommendations make an explicit effort to skip the first 100,000 reads in calculating the distribution while the LUMPY Express implementation uses these 100,000. The LUMPY Express implementation suggests that up to 1,000,000 reads are necessary to build the distribution which is 100x more reads than used in the LUMPY recommendations. Is either or both of these goals important in achieving a representative distribution? Neither distribution will be well-distributed throughout the genome since the bams have already been sorted at this point.

In my own quick testing, the distributions were more similar when using 1,000,000 reads than 10,000. It's also worth noting that (at least in my testing), gawk '{ if (NR<=1000000) print > "/dev/stdout" ; else print > "/dev/null" }' was much slower than head -n 1000000 which achieves the same goal of capturing only the first 1,000,000 lines.